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Abstract 

This  report  presents  a  method  for  automatic  synthesis  of  asynchronous  digital  systems 
from  high-level  data  flow  specifications.  We  present  an  extended  data  flow  model  that  ac¬ 
curately  reflects  the  behavior  of  the  asynchronous  components  so  that  the  data  flow  spec¬ 
ification  can  be  directly  mapped  into  a  hardvare  realization.  In  addition,  we  develop  a 
timing  model  for  the  basic  asynchronous  building  blocks  and  show  how  to  derive  the  timing 
parameters  of  a  composed  system.  This  timing  model  can  also  be  used  at  the  data  flow 
level,  allowing  designers  to  explore  various  design  alternatives.  We  then  describe  a  number 
of  applications  of  the  data  flow  specification  for  high-level  synthesis  such  as  schemes  for 
resource  sharing,  local  transformations  for  data  flow  description  optimization,  and  alloca¬ 
tion  and  sequencing  of  operations  for  given  resources.  Finally,  we  present  two  examples,  a 
16-bit  multiplier  and  a  16-point  FIR  digital  filter,  where  the  number  of  modules  have  been 
altered  at  the  data  flow  level  using  this  synthesis  method.  The  effectiveness  of  the  data  flow 
specification  and  performance  analysis  has  been  demonstrated  from  the  areas  and  the  back- 
annotated  simulation  of  actual  layouts  generated  using  an  industrial  standard  cell  library 
and  commercial  CAD  tools. 


1  Introduction 

This  paper  presents  a  method  for  the  automatic  synthesis  of  asynchronous  digital  systems. 
The  input  is  a  data  flow  specification  of  the  system’s  behavior  and  the  result  is  a  design  with 
sufficient  detail  to  permit  fabrication.  Existing  approaches  focus  primarily  on  the  synthesis 
of  asynchronous  control  circuits.  Our  work  emphasizes  asynchronous  systems  such  ais  micro¬ 
processors  or  special  purpose  processors  used  in  image  and  signal  processing  applications, 
where  there  is  an  enormous  potential  for  concurrent  computations  at  the  function  level.  Our 
approach  uses  a  data-driven  model  to  describe  the  functionail  behavior  of  asynchronous  sys¬ 
tems.  In  this  model,  the  data  flow  specification  frees  the  designer  from  having  to  identify 
concurrent  activities  and  their  synchronization  explicitly,  thus  allowing  complete  exploitation 
of  concurrency.  More  importantly,  our  data-driven  model  allows  the  various  system  measures 
such  as  delay  and  area  to  be  incorporated  in  the  high-level  specification.  This  enables  the 
designer  to  rapidly  explore  many  design  alternatives  at  the  data  flow  level,  examining  the 


tradeoff  between  performance  and  area.  Furthermore,  these  higli-level  design  decisions  can 
be  replaced  by  design  automation  algorithms,  namely,  high-level  synthesis  [11,  IS]. 

There  are  three  main  aspects  of  a  synthesis  system:  the  specification,  the  nnlt:ahon  and 
the  methods.  The  specification  deals  with  developing  a  suitable  representation  of  the  abstract 
behavior.  The  realization  is  a  representation  of  the  system  in  terms  of  a  set  of  interconnected 
components.  The  methods  are  a  collection  of  techniques  that  translate  a  specification  to  a 
realization. 

We  start  by  describing  the  components  of  the  realization.  That  is,  we  first  describe 
the  ba^ic  building  blocks  of  the  asynchronous  system.  The  data  communication  of  these 
building  blocks  is  controlled  by  the  handshaking  protocol.  We  then  show  that  the  handshaking 
protocol  can  be  accurately  described  by  the  token  of  a  data  flow  model.  Next  we  describe  the 
abstract  specification.  Here  we  present  the  classical  data  flow  graph  (DFG)  representation 
as  the  behavior  specification  of  asynchronous  systems  in  our  synthesis  method.  With  the 
consideration  of  hardware  (register)  cost,  we  derive  an  extended  data  flow  graph  (EDFG) 
based  on  the  same  token- handshaking  model.  The  structure  of  the  EDFG  is  very  similar  to 
the  structure  of  the  DFG.  However,  the  semantics  of  the  EDFG  are  defined  to  comply  vvith 
the  behavior  of  the  asynchronous  circuits  without  registers.  We  also  show  that  each  function 
node  in  a  DFG  corresponds  to  a  composition  of  register  nodes  and  a  non- registered  node 
in  an  EDFG.  Thus  the  EDFG  provides  a  bridge  between  an  abstract  specification  and  the 
implementation.  After  having  discussed  the  two  ends  of  the  synthesis  system,  we  describe 
the  methods  for  translating  an  EDFG  to  a  realization.  This  is  done  with  respect  to  a  given 
library  of  components.  Then  we  develop  a  timing  model  for  the  basic  building  blocks  and 
show  how  to  derive  the  timing  parameters  for  any  composition  of  the  building  blocks.  This 
timing  model  is  applied  to  the  data  flow  representation.  After  defining  the  input  specification 
and  its  timed  behavior  model,  we  present  several  synthesis  methods  at  data  flow  level  to  help 
designers  explore  different  design  alternatives.  Sharing  schemes  are  templates  in  a  DFG  to 
share  a  resource  or  resources  by  the  same  type  of  operations.  Local  transformations  perform 
peephole  optimization/reduction  in  a  DFG  specification.  According  to  sharing  schemes  and 
their  performance/area  effects,  allocation  and  sequencing  algorithms  allocate  operations  in  a 
DFG  to  a  given  set  of  modules  and  order  the  sequence  of  operations  which  share  a  common 
module.  By  applying  these  synthesis  methods,  designers  may  obtain  various  designs  with 
the  objective  of  minimizing  delay,  throughput,  or  area. 

Our  design  procedure  of  asynchronous  systems  is  shown  in  Figure  1.  This  paper  focuses 
on  the  top  half  of  this  design  procedure,  especially  the  data  flow  specification,  its  timed 
behavior  model,  and  its  relation  to  the  realization  and  the  synthesis  methods.  The  details 
of  realizations  and  synthesis  algorithms  are  deferred  to  future  papers.  Our  layout  imple¬ 
mentation  relies  on  the  MOSIS  netlist-to~parts  service  [35,  36].  This  paper  is  organized  as 
follows.  Section  2  reviews  related  work.  Section  3  identifies  the  basic  building  blocks  of  the 
realization  and  maps  their  behavior  into  a  data  flow  model.  Section  4  describes  the  data 
flow  specification  DFG  and  the  extended  specification  EDFG.  Section  5  constructs  a  timing 
model  for  the  building  blocks  and  the  function  node  for  the  DFG/EDFG.  By  using  this 
timing  model,  designers  are  able  to  analyze  the  system  performance  and  other  system  mea- 
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sures  at  data  flow  level.  Section  6  presents  several  synthesis  methods  using  the  DFG/EDFC 
specification,  and  their  effect  in  terms  of  performance  and  area  can  be  easily  determined. 
Section  7  presents  two  detailed  examples  to  demonstrate  our  design  method.  This  includes 
the  DFG  specification,  various  design  alternatives  that  are  possible  with  the  given  DFG. 
and  mapping  the  various  designs  to  obtain  a  number  of  realizations.  These  design  alterna¬ 
tives  are  implemented  with  a  standard  cell  library  and  their  performance  and  area  costs  are 
presented.  The  last  section  presents  a  conclusion  for  this  work. 


Figure  1;  Overview  of  our  design  system. 


2  Background 

Much  of  the  classical  work  done  in  asynchronous  design  has  focused  primarily  on  gate  level 
control  circuits.  Methods  for  realizing  such  circuits  are  beised  on  the  Huffman  model  [10,  17] 
of  a  finite  state  machine  (FSM).  Such  an  approach  is  practical  only  for  relatively  small 
circuits.  Moreover,  the  FSM  model  cannot  describe  concurrent  behavior  at  any  higher  level. 

During  the  past  five  years  there  has  been  a  tremendous  resurgence  of  interest  in  the  design 
of  large  scale  asynchronous  system  [16]  and  more  recently  in  the  automatic  synthesis  of  such 
systems  [5,  6,  21].  An  important  aspect  of  certain  types  of  asynchronous  designs  is  that  they 


make  it  feasible  to  carry  out  large  system  design  in  a  truly  modular  fashion  by  composing 
independently-designed  components  and  ensuring  correctness  by  construction  [26,  30].  In  the 
area  of  asynchronous  design  there  appear  to  be  two  approaches.  One  approach  focuses  on  the 
design  of  reliable  asynchronous  circuits,  e.g.,  hazard-free  asynchronous  circuits  and  delay- 
insensitive  circuits.  These  methods  are  based  on  the  manipulation  of  formal  specifications 
such  as  signal  transition  graphs  (STG)  and  Petri  nets  [7.  14,  15,  21,  23].  The  other  approach 
focuses  on  the  synthesis  of  asynchronous  systems  by  the  interconnection  of  pre  defined  asyn¬ 
chronous  modules.  These  methods  attempt  to  translate  a  high-level  language  specification 
such  as  CSP,  CSP-liked  descriptions,  OCCAM,  or  Trace  structures  [1.  5.  6.  9.  30]  into  a 
realization.  The  main  task  in  these  synthesis  approaches  is  to  correctly  decompose/refine 
the  given  behavior  description  into  atomic  constructs,  which  have  corresponding  pre-defined 
asynchronous  modules.  However,  much  work  done  in  the  decomposition  of  asynchronous  sys¬ 
tems  has  mainly  focused  on  the  synthesis  of  control  circuits.  There  appears  to  be  little  work 
done  in  synthesis  of  both  control  and  data  paths  of  cisynchronous  systems.  In  particular, 
problems  related  to  the  incorporation  of  system  level  performance  mecisures  in  the  high-level 
specification  and  the  synthesis  of  asynchronous  systems  that  take  into  account  constraints 
on  the  availability  of  resources  have  not  been  dealt  with  adequately.  It  is  necessary  for  the 
designer  to  explore  these  various  design  alternatives.  The  goal  of  our  research  is  to  tackle 
these  design  issues  at  system  level. 

Our  approach  the  resembles  those  approaches  presented  in  [13,  19,  20]  in  the  design 
specification  and  the  mapping  method.  However,  our  approach  is  different  from  theirs  in  the 
following  sense.  Their  basic  modules  are  synchronous  circuits.  In  their  approach,  each  node 
in  the  data  flow  graph  maps  to  a  unique  hardware  module.  Due  to  the  physical  limitation 
of  VLSI,  they  implement  module  selection  techniques  to  reduce  the  area  of  a  design.  They 
partition  the  nodes  of  a  data  flow  graph  into  multiple  groups  so  that  each  group  can  be 
implemented  on  a  chip.  Our  bcisic  modules  are  instead  aisynchronous  circuits.  In  our  model, 
a  token  in  a  DFG  represents  not  only  data  but  also  the  synchronization  state  between 
modules,  i.e.,  the  state  of  handshaking  signals  between  modules.  Despite  the  difference  of  the 
basic  circuit  models  between  their  approach  and  our  approach,  the  ideas  of  module  selection 
and  system  partition  are  important  and  applicable  in  our  design  procedure.  However,  we 
emphasize  the  module  utilization  more  in  this  research.  In  order  to  satisfy  constraints  on 
area  and/or  performance  we  develop  techniques  for  scheduling  and  allocation  over  the  nodes 
in  a  DFG.  Therefore,  a  node  representing  an  operator  (or  a  hardware  module)  in  the  final 
EDFG  may  correspond  to  more  than  one  node  representing  an  operation  (or  a  computation) 
in  the  original  DFG  specification. 

3  Hardware  Implementation  and  Data  Flow  Model 

The  hardware  model  that  is  employed  here  is  based  on  Sutherland’s  Micropipelines  [28].  This 
model  assumes  that  request  signals  are  bundled  with  the  data  signals  to  ensure  proper  oper¬ 
ation,  namely,  the  bundled  data  convention.  Unlike  speed-independent  and  delay-insensitive 
designs,  the  micropipeline  model  requires  determination  of  the  delays  in  the  computational 
blocks.  This  does  not  pose  any  serious  problem,  as  this  can  be  done  in  a  manner  similar  to 
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(a)  Data  transfer  between  two  blocks 
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(b)  Two-phase  handshaking  (c)  Four-phase  handshaking 


Figure  2;  Data  transfer  and  handshaking  protocol. 


the  conventional  design  of  synchronous  systems. 

A  system  consists  of  a  collection  of  functional  blocks  with  data  transfers  taking  place 
between  two  or  more  functional  blocks  or  between  a  functional  block  and  the  surrounding 
environment.  Data  transfers  between  any  two  blocks  rely  on  a  handshaking  protocol.  Each 
block  will  be  activated  whenever  its  input  data  is  available.  Therefore,  the  operations  of 
functional  blocks  in  micropipelines  are  asynchronous,  concurrent,  and  data-driven. 

3.1  Data  Transfers  and  Handshaking  Protocols 

The  handshaking  protocol  used  in  our  design  method  can  be  a  two-phase  and/or  a  four-phase 
handshaking  protocol.  These  are  shown  in  Figure  2. 

Our  current  implementation  follows  the  two-phase  handshaking  protocol  with  the  bundled 
data  convention  [26,  28].  Referring  to  Figure  2(b)  we  see  that  there  are  three  events  in  each 
cycle  of  data  transfer.  First,  valid  data  is  put  on  the  data  bus  by  the  sender.  Second,  a  signal 
transition  is  activated  on  the  request  line  by  the  sender  to  notify  the  receiver  that  data  is 
available.  Third,  a  signal  transition  is  activated  on  the  acknowledge  line  by  the  receiver  to 
notify  the  sender  that  the  data  has  been  received  so  that  another  cycle  of  the  data  transfer 
can  begin. 

For  the  four-phjise  handshaking  protocol  shown  in  Figure  2(c),  the  request  line  and  the 
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Figure  3:  The  behavior  of  Muller  C-element. 


acknowledge  line  are  initialized  to  0  at  the  start  of  each  cycle  of  data  transfer.  The  first  three 
events  are  the  same  as  those  in  the  two-phase  handshaking  protocol.  The  next  two  events 
are  that  the  request  is  reset  to  0  by  the  sender  and  that  the  acknowledge  is  reset  to  0  by  the 
receiver.  In  terms  of  the  period  of  data  validation,  there  two  kinds  of  conventions  for  the  four- 
phase  handshaking  protocol  [4].  In  the  narrow  convention,  the  sender  holds  the  data  valid 
from  the  rising  request  signal  to  the  rising  acknowledge  signal.  In  the  broad  convention,  the 
sender  holds  the  data  valid  from  the  rising  request  signal  to  the  falling  acknowledge  signal. 
Although  our  current  implementation  follows  the  two-phase  handshaking  protocol  with  the 
bundled  data  convention,  a  system  may  contain  different  protocols  for  different  data  transfers 
in  its  implementation,  as  long  as  the  sender  and  the  receiver  of  each  data  transfer  follow  the 
same  protocol 

3.2  Realization  of  a  Basic  Block 

A  functional  block  as  proposed  by  Sutherland  [28]  h^ls  the  structure  shown  in  Figure  5(a). 
There  are  three  basic  elements  in  this  structure.  The  Muller  C-element,  represented  by  a  “C” 
gate,  is  used  to  control  the  handshaking  protocol.  The  asynchronous  register,  represented  by 
a  “reg”  block,  is  used  to  capture  and  pass  input  data.  The  computational  peirt,  represented 
by  a  “Logic”  block,  is  used  to  perform  the  functional  computation  for  the  structure,  e.g., 
addition.  The  oval  node  in  this  structure  represents  an  added  delay,  which  is  used  to  ensure 
that  the  output  data  transfer  satisfies  the  bundled  data  convention. 

Muller  C-element  There  are  two  inputs  and  one  output  for  a  Muller  C-element.  The 
output  of  a  C-element  is  1  if  all  the  inputs  are  1,  and  it  is  0  if  all  the  inputs  are  0;  otherwise 
its  value  remains  unchanged  [26].  A  two  input  C-element  can  be  viewed  as  a  logical  and  of 
two  events,  where  an  event  can  be  a  0-1  or  a  1-0  transition  [28].  This  behavior  is  shown  in 
Figure  3 

Asynchronous  register  The  asynchronous  register  proposed  by  Sutherland  is  defined  as 
follows.  There  are  four  terminals  for  the  register.  “Di”  and  “Do”  are  the  data  input  terminal 
and  the  data  output  terminal  of  the  register;  “c”  and  “p”,  which  are  called  capture  and  pass 

^The  broad  convention  and  the  narrow  convention  of  the  four-phase  handshaking  protocol  are  two  different 
protocols. 

^Circuit  delay  is  not  considered  in  this  figure. 
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Figure  4:  The  behavior  of  asynchronous  register. 


respectively,  are  two  one-bit  control  signals  of  the  register.  If  the  value  of  “c’’  equals  the 
value  of  “p”,  then  the  value  of  “Di”  is  passed  to  “Do”;  otherwise  the  value  of  “Do”  remains 
unchanged.  Operationally,  “c”  and  “p”  are  initialized  to  0,  then  signal  transitions  (events) 
occur  at  “c”  and  “p”  in  the  sequence  of  cpcp....  This  behavior  is  shown  in  Figure  4  In 
the  operation  of  the  asynchronous  register,  event  “c”  always  captures  input  data  to  let  the 
output  hold  the  last  input  value  befor'’  event  “c”,  e.g.,  Dil'  and  Di3’  in  Figure  4  are  captured 
by  “c”  events;  event  “p”  starts  the  passing  mode  of  the  register,  e.g.,  Dil,  Di3,  and  Di5  in 
Figure  4  are  passed  to  “Do”. 

Computational  part  The  computational  part  can  be  implemented  by  combinational 
logic,  however,  added  delay  is  required  to  ensure  the  handshaking  protocol.  The  compu¬ 
tational  part  can  also  be  implemented  by  differential  cascade  voltage  switch  logic  (DCVSL) 
without  added  delay  [21].  DCVSL  is  suitable  for  four- phase  handshaking  operation;  there¬ 
fore,  we  need  to  have  two-to-four  and  four-to-two  pheise  change  circuits  to  make  this  kind 
of  circuit  useful  in  two-ph^e  design.  The  bundled  data  convention  and  the  bounded  delay 
model  are  used  here  primarily  to  save  silicon  area  and  the  design  time  for  the  computational 
parts. 

Combining  Muller  C-elements  and  asynchronous  registers  forms  the  pipeline  structure 
of  asynchronous  systems,  i.e.,  micropipelines  [28].  We  take  a  single  stage  from  Sutherland’s 
micropipelines  as  a  basic  functional  block  in  our  system,  and  it  is  shown  in  Figure  5(a).  “Ri”, 
“Ai”,  and  “Di”  correspond  to  input  request  signal,  input  eicknowledge  signal,  and  input  data 
signals;  “Ro”,  “Ao”,  and  “Do”  correspond  to  output  request  signal,  output  acknowledge 
signal,  and  output  data  signals.  The  input/output  behavior  of  the  basic  block  is  shown  in 
Figure  5(b).  Operationally,  “Ri”,  “Ai”,  “Ro”  and  “Ao”  are  initialized  to  0.  Notice  that  there 
is  an  inversion  at  one  of  the  inputs  at  Muller  C-element;  it  means  that  there  is  an  initial 
event  at  that  input.  Therefore,  event  “Ri”  will  generate  an  event  at  the  output  of  the  Muller 
C-element,  which  will  capture  the  input  data  “Di”.  After  passing  through  the  register,  this 

^Circuit  delay  is  not  considered  in  this  figure. 
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Figure  5;  The  basic  block  of  micropipelines. 

event  will  pass  to  “Ai”  to  notify  the  input  data  transfer  that  “Di’'  is  stored,  i.e.,  its  value 
can  be  change.  In  other  words,  one  cycle  of  input  data  transfer  is  complete.  The  captured 
input  is  presented  at  the  output  of  the  register,  which  is  then  operated  on  the  computational 
part.  The  added  delay  is  to  ensure  that  the  valid  data  is  produced  before  the  arrival  of  event 
“Ro”.  Therefore,  the  added  delay  equals  the  critical  path  delay  of  the  computational  part 
plus  some  safe  margin  delay.  Event  “Ao”  will  complete  one  cycle  of  output  data  transier, 
and  it  allows  the  register  to  pass  new  input  data  to  the  functional  block.  After  “p”  of  the 
register  receives  an  event  from  “Ao”,  the  Muller  C-element  receives  an  (initial)  event  to  allow 
another  cycle  of  input  data  capture.  In  case  the  new  ”Di”  and  “Ri”  have  arrived  before  the 
transfer  of  output  data  is  completed,  the  Muller  C-element  will  wait  until  another  input  of 
the  C-element  receives  an  (initial)  event  from  “Ao”  through  “p”  of  the  register. 

3.3  Data  Flow  Model  for  Basic  Blocks 

The  main  reason  to  use  data  flow  specification  in  our  system  is  that  the  behavior  of  basic 
blocks  of  asynchronous  systems  can  be  described  by  a  data  flow  model.  We  view  each  bcisic 
block  ais  a  functional  unit.  Data  is  available  at  the  input  of  the  block  and  is  captured  by  the 
block,  data  is  produced  at  the  output  of  the  block,  which  becomes  the  ’  )ut  data  of  another 
functional  block.  The  behavior  of  the  bcisic  block  is  analogous  to  the  be.  ior  of  a  functional 
node  in  a  data  flow  graph,  which  absorbs  input  tokens  and  generates  output  tokens.  This 
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Figure  6:  Data  flow  model  for  basic  blocks. 


analogy  is  shown  in  Figure  6. 

Figure  6(a)  shows  a  sequence  of  states  to  describe  the  same  input/output  behavior  of  the 
basic  aisynchronous  block  that  Figure  5(b)  describes,  where  labels  1,  2,  and  3  represent  the 
sequence  of  events  -  data  available,  the  request  signal  transition,  and  the  acknowledge  signal 
transition  of  the  two-phase  handshaking  protocol.  Each  state  in  Figure  6(a)  represents  that 
the  block  has  just  received  or  produced  an  event,  which  is  denoted  by  a  numbered  circle,  ajid 
is  waiting  for  next  event.  The  two  states  which  are  grouped  together  at  the  left  of  Figure  6(a) 
correspond  to  zui  idle  functional  node  in  the  data  flow  model.  Although  there  is  input  data 
available  at  the  last  state  of  this  group,  the  block  has  not  been  notified  of  the  availability 
of  input  data  by  the  external  world  *.  Only  when  event  “Ri”  is  activated  by  the  external 
world,  the  basic  block  knows  that  there  is  data  available  at  “Di”.  Therefore,  the  top  center 
state  in  Figure  6(a)  corresponds  to  a  data  flow  function  node  with  an  input  token.  After  the 
block  captures  the  input  data  and  completes  its  calculation,  the  block  produces  an  output 
data,  and  the  external  world  is  notified  by  event  “Ro” .  Therefore,  the  state  at  bottom  center 
of  Figure  6(a)  correspond  to  the  output  token  generated  in  the  data  flow  model.  After  the 
external  world  releases  the  output  data  by  activating  event  “Ao”,  the  block  becomes  idle 

■’The  external  world  means  the  surrounding  environment  of  the  basic  block. 
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again.  Two  transient  states  in  Figure  6(a)  are  not  mapped  into  the  ilata  tlou  model,  and 
they  represent  the  functional  operation  in  the  real  circuit  which  takes  time.  However,  thev 
can  be  ignored  in  the  high-level  data  flow  model  and  are  represented  by  a  proper  timing 
model  for  system  analysis. 

Handshaking  protocol  and  token  model  The  key  of  this  analogy  is  to  model  the  data 
transfer,  which  is  bcised  on  two-phase  handshaking  protocol,  by  the  token  movement  in  the 
data  flow  model.  Referring  to  Figures  2(a)  and  2(b),  the  data  transfer  between  event  2 
(Request)  and  event  3  (Acknowledge)  is  the  state  that  the  data  is  available  on  the  data  bus 
aj)d  is  waiting  to  be  captured  by  the  receiver.  This  state  is  mapped  into  an  appearance 
of  a  token  between  two  function  nodes  in  the  corresponding  data  flow  graph;  the  token 
is  generated  by  the  function  node  which  corresponds  to  the  sender  block,  and  it  will  be 
absorbed  by  the  function  node  which  corresponds  to  the  receiver  block.  Later,  we  will  derive 
the  extended  data  flow  graph  based  on  the  same  token  model. 

4  Data  Flow  Specification 

The  data  flow  graph  (DFG)  is  used  as  the  input  specification,  and  it  is  based  on  the  token 
model  used  in  data  flow  computing  [8].  In  this  section,  we  briefly  describe  the  structure  and 
semantics  of  the  DFG  used  in  our  synthesis  system. 

A  DFG  is  a  directed  graph  with  typed  nodes  and  port-specific  arcs,  where  port  refers 
to  the  input/output  terminals  of  a  node.  Each  node  in  a  DFG  belongs  to  a  finite  set  of 
node  types  which  represent  the  basic  constructs  of  the  DFG  specification.  Each  directed 
arc  in  a  DFG  connects  a  specific  output  port  of  a  node  to  a  specific  input  port  of  a  node. 
The  semantics  of  a  DFG  are  expressed  by  the  movement  of  tokens.  A  token  represents  the 
presence  of  data  on  the  corresponding  input.  A  node  is  activated  when  all  its  necessary 
input  arcs  have  tokens.  An  activated  node  computes  or  fires  by  absorbing  all  the  tokens  on 
its  inputs  and  placing  tokens  on  its  outputs.  There  is  no  notion  of  synchronization  among 
activated  nodes,  as  these  nodes  operate  asynchronously  and  concurrently  [8]. 

4.1  Basic  Constructs 

By  considering  area/performance  efficiency  of  asynchronous  block  implementation,  we  have 
generalized  and  enriched  the  basic  DFG  constructs,  which  are  shown  in  Figure  7,  from  the 
conventional  data  flow  specification.  For  example,  the  conventional  data  flow  specification 
often  uses  binary  input  (or  output)  control  constructs  such  as  the  Distributor  and  the  Selec¬ 
tor.  To  distribute  one  data  to  one  of  eight  destinations,  we  would  need  to  use  three  levels  of 
these  two-input  Distributors.  In  terms  of  delay  and  area  consumption,  we  found  it  is  more 
efficient  to  implement  a  single  block  to  handle  one-to-eight  distribution  than  to  use  three 
levels  of  the  two-input  Distributors.  To  distinguish  from  the  conventional  constructs  in  data 
flow  specification,  we  prefix  the  names  of  these  multiple  input/output  control  constructs  with 
“M” .  These  enhancements  imply  that  the  set  of  basic  constructs  may  grow  in  the  future  as 
long  as  the  new  constructs  satisfy  the  data  flow  model  and  they  are  needed  in  the  description 
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Figure  7;  Basic  constructs  of  the  DFG. 
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of  asynchronous  systems  The  behavior  of  the  basic  constructs  shown  in  Figure  7  are  given 
below. 

•  MJoin:  If  there  is  a  token  at  each  input,  MJoin  absorbs  all  the  input  tokens  and 
generates  an  output  token.  The  output  token  represents  all  data  values  from  all  input 
tokens. 

•  MFork:  If  there  is  an  input  token,  MFork  absorbs  the  input  token  and  generates  a 
token  on  each  of  its  output  with  the  same  data  value  as  on  the  input. 

•  MDistributor:  If  there  is  a  token  at  the  data  input  port  and  a  token  at  the  condition 
input  port  carrying  the  value  m,  MDistributor  absorbs  both  input  tokens  and  generates 
a  token  at  output  port  m  with  the  same  value  as  on  the  data  input  port. 

•  MSelector:  If  there  is  a  token  at  input  port  m  and  a  token  at  the  condition  input  port 
carrying  the  value  m,  MSelector  absorbs  both  input  tokens  and  generates  an  output 
token  with  the  same  data  value  as  on  input  port  m. 

•  Pass((cond)):  If  there  is  a  token  at  the  data  input  port  and  a  token  at  the  condition 
input  port,  Pass((cond))  absorbs  both  input  tokens,  and  generates  an  output  token 
with  the  same  data  value  as  on  the  data  input  if  the  condition  data  equals  (cond). 

•  Arbiter:  If  there  exist  token(s)  at  the  input  port(s)  and  there  is  no  token  at  the 
output  port,  one  and  only  one  input  token  is  absorbed  and  passed  to  the  output  port. 

•  Atomic  functions:  These  represent  computational  nodes,  e.g.,  adders,  multipliers, 
and  so  on. 

®The  minimum  set  of  basic  constructs  is  not  very  meaningful  for  a  hardware  description  language. 
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•  Macro  function:  A  macro  function  represents  a  function  defined  by  another  data 
flow  graph,  and  it  supports  hierarchical  description. 


There  are  three  rules  regarding  the  data  flow  specification  and  its  behavior  model: 

1.  At  most  one  token  is  allowed  on  an  arc  at  any  time. 

2.  Every  basic  construct  can  absorb  tokens  from  its  input  port(s)  only  if  no  token  is 
present  at  any  of  its  output  arcs.  In  other  words,  no  tokens  are  allowed  to  accumulate 
in  any  of  the  basic  constructs. 

3.  No  recursive  (macro)  function  is  allowed  in  our  system. 

4.2  Data  Types 

Since  the  goal  is  to  transform  or  translate  DFG  descriptions  into  hardware  realizations,  each 
data  item  has  a  fixed  format  as  specified  by  the  data  type.  For  example,  the  input  and  the 
output  of  a  16-bit  adder  have  certain  data  formats,  e.g.,  the  input  contains  two  16-bit  data, 
and  the  output  contains  a  1-bit  carry-out  and  a  16-bit  sum.  There  are  three  b£isic  data 
types: 

1.  A  null  data  type  is  denoted  by  null. 

2.  A  set  of  n-bit  wire  data  types  is  denoted  by  nb,  where  n  is  a  positive  integer. 

3.  Group  data  types  are  denoted  by  {gi,g2-,  •  •  •^gm)y  where  each  gi  is  a  null  data  type,  or 
a  wire  data  type,  or  another  group  data  type. 

The  following  items  in  a  DFG  are  data  typed:  input/output  ports,  directed  arcs,  and  tokens. 
If  an  output  port  is  connected  to  an  input  port  through  a  directed  arc,  the  output  port,  the 
input  port,  the  directed  arc,  and  tokens  which  flow  through  the  arc  should  have  the  same 
data  type.  The  data  type  of  an  input /output  port  depends  on  what  kind  of  function  this 
node  is.  For  example,  a  16-bit  adder  may  have  input  data  type  (16b, 16b)  and  output  data 
type  (lb, 16b). 

Except  atomic  functions,  macro  functions,  and  MJoin,  input  ports,  excluding  the  con¬ 
ditional  input,  and  output  ports  of  any  construct  have  the  same  data  type.  MJoin  absorbs 
tokens  from  all  inputs  and  generates  a  token  representing  all  input  tokens.  If  n  inputs  of 
MJoin  from  left  to  right  have  data  types  ti,t2,.-.,tn,  then  the  data  type  of  the  output  is 
(ti,  <2,  •  •  • ,  <n)-  During  the  manipulation  of  a  DFG,  a  data  type  may  be  reduced  to  an  equiv¬ 
alent  data  type,  {gi)  is  equivalent  to  gi,  and  (51,52, •  •  •  ,5j-i,5j,fif2+i, •  •  -^Qm)  is  equivalent 
to  (5i,52,---,5j-i,5j+i,---,5m)  if  5j  is  a  null  data  type.  A  data  type  is  primitive  if  it  is 
neither  a  group  data  type  containing  null  data  type(s)  nor  a  group  data  type  containing  only 
one  data  type.  Every  data  type  is  equivalent  to  a  unique  primitive  data  type.  Hence  only 
primitive  data  types  are  considered  during  the  manipulation  of  DFG  constructs,  e.g.,  MJoin 
shown  in  Figure  8. 
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Figure  8:  Example:  manipulation  of  data  types. 


There  is  no  distinction  between  a  data  signal  and  a  control  signal  in  our  system.  A 
control  signal  in  the  conventional  data  flow  graph  is  an  one- bit  data  signal  which  carries 
either  true  (logic  1)  or  false  (logic  0),  i.e.,  its  data  type  is  lb.  Furthermore,  we  generalize 
control  signals  to  be  nb,  which  can  control  multiple  (2”)  input/output  choices. 

In  a  DFG,  each  token  carries  a  data  value.  The  data  value  of  a  type  null  token  is  ().  A 
valid  value  of  a  type  nb  token  is  (ii, . . .  ,x„),  where  each  for  1  <  i  <  n  is  any  valid  V2due 
on  a  wire,  e.g.,  0,  1,  and  X.  A  valid  data  value  of  a  group  data  type  {gi,g2,-  ■  token 
is  (ui,U2, . . .  jUm),  where  Vi  is  a  valid  value  of  data  type  jr,  for  1  <  i  <  m.  For  example, 
the  valid  values  of  a  8b  token  are  (xi,i2,X3,X4,X5,X6,X7,xg)’s  with  each  Xi  6  {0,  IjX}  for 
1  <  j  <  8.  Similarly,  the  valid  values  in  (4b, 4b)  are  ((xi,X2,X3,  X4),  (j/i,y2,y3,y4))’s  with 
XiiVi  €  {0, 1,X}  for  1  <  t  <  4.  Both  of  the  above  two  data  types  are  implemented  as  eight 
wires  in  hardware,  but  they  may  represent  different  meanings  in  the  DFG.  For  example,  the 
former  may  be  an  8-bit  integer,  and  the  latter  may  be  two  4-bit  integers. 

4.3  The  Need  of  Extended  Data  Flow  Graph 

In  Section  3.3  we  use  a  data  flow  graph  to  model  basic  blocks  in  micropipelines.  Conversely, 
we  can  translate  a  DFG  description  into  an  asynchronous  system,  which  is  composed  of  basic 
blocks.  However,  each  input  port  of  every  block  needs  a  register  to  latch  data,  so  this  kind 
of  implementation  may  result  in  mauiy  registers.  For  example,  the  two-input  addition  DFG 
description  in  Figure  9  can  be  directly  translated  into  the  implementation  by  mapping  nodes 
MJoin  and  ADD  into  blocks.  In  this  example,  there  are  two  levels  of  registers.  In  terms  of 
2u:ea  and  performance  efficiency,  we  don’t  need  both  levels  of  registers.  Therefore,  removing 
the  input  register  of  the  ADD  block  yields  an  implementation  with  better  performance  and 
smaller  area. 

4.3.1  Register  Blocks  and  Computational  Blocks 

In  order  to  reduce  the  cost  of  registers,  we  separate  registers  from  basic  blocks  of  mi¬ 
cropipelines.  Two  basic  blocks  are  defined:  the  register  block  and  the  computational  block, 
shown  in  Figure  10(a).  Their  input/output  behaviors  are  shown  in  Figure  11,  where  the 
variables  D,//,  Dm,  D,p,  D/p,  and  Dt,p  are  delays  between  events  which  will  be  defined  later. 


In  terms  of  input/output  events,  the  behavior  of  the  register  block  is  exa  ly  the  same 
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Figure  9:  Two-input  addition. 


as  the  basic  block  of  micropipelines.  On  the  other  hand,  the  input  events  and  the  output 
events  of  a  computational  block  are  closely  related.  Because  there  is  no  storage  in  the 
computational  block,  the  input  data  cannot  be  released  until  the  output  data  is  released. 
Therefore,  a  complete  output  event  cycle  of  the  two-pheise  handshaking  protocol  occurs 
between  event  “Ri”  and  event  “Ai”  of  the  input  event  cycle.  In  other  words,  event  “Di”  is 
followed  by  event  “Ri”,  which  is  followed  by  event  “Do”,  which  is  followed  by  event  “Ro”, 
which  is  followed  by  event  “Ao”,  which  is  followed  by  event  “Ai”.  Figure  12(a)  shows  the 
sequence  of  input /output  events. 

4.3.2  Extended  Data  Flow  Model  for  Computational  Blocks 

Since  the  behavior  of  the  functional  block  without  storage  differs  from  the  original  functional 
block  of  micropipelines,  a  phantom  node  is  used  to  represent  the  corresponding  data  flow 
construct  for  the  computational  block  in  Figure  10(b).  Using  the  same  handshaking  protocol- 
token  analogy  in  Section  3.3,  we  define  the  extended  data  flow  model  for  computational  blocks 
as  shown  in  Figure  12.  Each  state  in  Figure  12(a)  represents  that  the  block  has  just  received 
or  produced  an  event  and  is  waiting  for  the  next  event.  The  left  two  states  in  Figure  12(a) 
correspond  to  an  idle  phantom  functional  node  in  the  extended  data  flow  model.  When  event 
“Ri”  occurs,  the  computational  block  is  notified  by  the  external  world  that  an  input  data  is 
available  at  “Di”.  Because  there  is  no  memory  in  the  computational  block,  the  value  of  “Di” 
has  to  be  kept  valid  by  the  external  world  until  the  corresponding  output  data  is  releaised, 
i.e.,  event  “Ao”  occurs.  Therefore,  the  right  three  states  in  Figure  12(a)  correspond  to  a 
phantom  functional  node  with  an  input  token.  When  valid  data  is  produced  at  “Do”,  the 
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Figure  10:  The  structure  of  the  basic  blocks. 


(a)  The  timing  diagram  of  register  block 


(b)  The  timing  diagram  of  computational  block 


Figure  11:  The  behavior  of  the  basic  blocks. 
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(a)  The  behavior  of  a  computational  block 
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(b)  Extended  data  flow  model 


Figure  12:  Extended  data  flow  model  for  computational  blocks. 


external  world  is  notified  by  event  “Ro”.  Therefore,  the  bottom  right  state  in  Figure  12(a) 
corresponds  to  a  phantom  functional  node  with  an  output  token.  After  the  external  world 
releases  the  output  data  by  activating  event  “Ao”,  the  computation  block  activates  event 
“Ai”,  and  the  block  becomes  idle  again.  Figure  12(b)  is  the  corresponding  extended  data 
flow  model  for  the  computational  block.  Unlike  the  conventional  data  flow  model,  an  input 
token  is  not  absorbed  when  the  corresponding  output  token  is  produced;  the  input  token 
is  removed  when  the  corresponding  output  token  is  absorbed  by  the  external  world.  The 
output  token  looks  like  an  extension  of  the  input  token  through  the  phantom  node,  so  the 
output  token  is  called  an  extended  token  (of  the  input  token).  One  transient  state  in  Figure 
12(a)  is  not  mapped  into  the  extended  data  flow  model,  and  the  computational  block  is  reset 
in  this  state.  In  the  implementation  shown  in  Figure  10(a),  “Ai”  is  directly  connected  to 
“Ao”,  so  this  transient  state  takes  zero  delay®.  However,  it  may  remain  some  time  in  this 
reset  state  if  the  computational  block  is  implemented  by  DCVSL.  As  stated  previously,  the 
transient  state  can  be  ignored  in  the  high-level  data  flow  model  and  is  taken  care  of  with  a 

^Assume  that  wiring  delay  can  be  ignored. 
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Figure  13;  Basic  constructs  of  EDFG. 


proper  timing  model  for  system  analysis. 

4.4  Extended  Data  Flow  Graph 

After  separating  the  register  from  the  micropipeline  structure,  we  now  describe  a  simple  but 
very  useful  extension  to  the  DFG  to  describe  these  new  basic  blocks.  The  result  is  called 
an  extended  data  flow  graph  (EDFG),  which  provides  a  bridge  between  the  abstract  DFG 
specification  and  the  circuit  implementation.  The  set  of  basic  constructs  in  EDFG  are  shown 
in  Figure  13. 

Syntactically  an  EDFG  is  the  same  as  a  DFG.  However,  the  semantics  of  an  EDFG  are 
defined  so  as  to  comply  with  the  hardware  behavior  of  asynchronous  circuits.  The  essential 
difference  is  in  the  rules  that  govern  the  movement  of  tokens.  Beised  on  the  extended  data 
flow  model  described  in  Section  4.3.2,  the  behavior  of  EDFG  is  described  as  follows.  There 
are  two  kinds  of  tokens,  namely,  regular  tokens  and  extended  tokens.  Both  kinds  of  tokens 
represent  where  data  is  available.  A  regular  token  represents  the  data  that  is  the  direct 
output  token  of  a  register,  and  it  is  denoted  by  a  dark  circle.  An  extended  token  represents 
the  data  that  is  the  output  token  of  a  non-register  node,  and  it  is  denoted  by  a  circle. 
The  behavior  of  Storage,  the  only  non-phantom  construct  in  the  EDFG,  is  the  same  «is  the 
behavior  model  described  in  the  DFG;  a  Storage  absorbs  input  tokens,  which  are  either 
regular  or  extended,  and  generates  regular  output  tokens  with  the  same  values  as  the  input 
tokens.  The  behavior  of  a  phantom  construct  in  the  EDFG  is  similar  to  the  behavior  of  the 
corresponding  non-phantom  construct  in  the  DFG  except  for  the  following  differences: 

•  Input  tokens  to  a  phantom  node  can  be  regular  or  extended. 

•  A  phantom  node  generates  only  extended  tokens. 

•  When  a  phantom  node  generates  output  tokens,  it  does  not  absorb  its  input  tokens. 
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Figure  15:  A  MFork  equals  a  phantom  MFork  with  an  input  Storage. 

•  A  token  on  the  input  of  a  phantom  node  can  be  absorbed  only  if  all  its  extended  tokens 
are  absorbed. 

Figure  14  shows  the  basic  differences  in  the  movement  of  tokens  in  an  EDFG  and  the  corre¬ 
sponding  DFG.  After  the  output  tokens  are  generated,  the  input  arc  of  an  MFork  in  the  DFG 
can  receive  a  new  data  token,  but  the  input  arc  of  phantom  MFork  in  the  EDFG  cannot 
receive  a  new  data  token  until  all  its  output  tokens  are  absorbed. 

Extended  tokens  and  regular  tokens  In  terms  of  the  availability  of  data  and  the  anal¬ 
ogous  meaning  to  the  handshaking  protocol,  there  is  no  difference  between  the  extended 
token  and  the  regular  token.  The  purpose  of  defining  extended  tokens  is  to  emphasize  the 
semantic  difference  between  phantom  nodes  and  non-phantom  nodes,  which  are  used  in  the 
conventional  data  flow  graph  [8],  and  it  is  also  to  emphasize  the  relation  among  input  data 
and  output  data  of  phantom  nodes. 


DFG  vs.  EDFG  A  DFG  node  is  equivalent  to  its  phantom  counterpart  with  an  input 
storage  at  each  input  port,  e.g.,  the  MFork/phantom  MFork  in  Figure  15.  .Since  a  DFG 
description  can  be  replaced  by  an  EDFG  description,  why  do  we  need  the  DFG.'  The  first 
reason  is  that  the  DFG  lets  designers  focus  on  the  functional  description  without  worrying 
about  the  hardware  implementation,  e.g.,  how  many  adders,  where  to  assign  registers,  and 
so  on.  The  DFG  is  also  a  well-known  language/concept  for  data  flow  computing,  so  it  can 
be  easily  adopted  by  designers.  Another  reason  is  for  the  convenience  of  system  synthesis. 
DFG  and  EDFG  are  used  in  the  different  stages  of  asynchronous  system  synthesis  in  our 
system.  The  DFG  is  mainly  used  in  the  early  steps  of  system  synthesis  such  as  sequencing 
and  allocation,  mapping  of  sharing  schemes,  and  local  transformation,  where  a  register  is 
assumed  for  every  data  transfer.  The  EDFG  is  mainly  used  in  the  synthesis  steps  such  as 
register  minimization,  deadlock  prevention,  and  local  transformation  before  the  specifications 
are  mapped  to  hardware  modules.  Notice  that  partitioning  the  synthesis  procedure  into  steps 
is  not  unique  and  that  the  tasks  of  the  synthesis  steps  are  usually  closely  related  [11,  18]. 

4.5  Hardware  Translation  -  Syntax- Directed  Method 

We  adopt  the  syntax-directed  method  [5,  6]  to  realize  the  physical  design  from  the  extended 
data  flow  graph  (EDFG)  specification.  In  this  method,  each  basic  construct  in  the  high- 
level  specification  is  directly  translated  into  a  corresponding  hardware  module.  Therefore, 
the  data  flow  graph  not  only  describes  the  behavior  of  a  system  but  also  represents  the 
structure  of  the  system.  By  using  this  method,  the  correctness  of  a  hardware  implementation 
is  proven  by  construction.  Therefore,  the  design  method  mainly  focuses  on  mapping  the 
constructs  and  the  behavior  models  of  the  EDFG  description  into  the  functional/control 
blocks  of  the  micropipeline  structure,  and  on  reflecting  the  hardware  characteristics  of  the 
functional/control  blocks  to  the  parameters  of  the  DFG  constructs. 

Translating  EDFG  constructs  into  asynchronous  components  A  path  in  an  EDFG 
is  mapped  to  a  two-phase  handshaking  data  transfer  bus,  including  a  data  bus  (wires)  with 
the  same  data  type,  a  request  line,  and  an  acknowledge  line.  A  token,  either  regular  or 
extended,  on  a  data  flow  path  corresponds  to  the  state  in  which  the  data  and  the  request 
have  been  sent  but  the  acknowledge  has  not  been  received  by  the  sender.  This  correspondence 
allows  us  to  design  a  component  corresponding  to  each  construct  in  a  EDFG,  ensuring  that 
the  interface  requirement  is  satisfied.  Thus  there  is  a  one-to-one  correspondence  between 
the  elements  of  an  EDGF  and  the  hardware  modules  (see  Figure  16).  We  have  developed  all 
the  mappings  in  our  cell  library  [32],  e.g.,  Figure  17  shows  a  design  for  the  4-output  MFork. 

Hardware  properties  in  EDFG  Since  each  basic  construct  of  EDFG  is  directly  mapped 
into  an  asynchronous  module,  the  hardware  properties  of  this  asynchronous  module  are 
attached  to  its  corresponding  node  in  a  EDFG  description:  (1)  Z?,//(n,)  is  the  forward  latch 
time  of  register  node  n^;  (2)  D,(,;(ni)  is  the  backward  latch  time  of  register  node  rzi;  (3)  Dgp{ni) 
is  the  propagation  delay  time  of  register  node  n^;  (4)  Dfp{ni)  is  the  forward  propagation  delay 
time  of  phantom  node  n,;  (5)  Dbp{ni)  is  the  backward  propagation  delay  time  of  phantom 
node  n,;  (6)  5(n,)  is  the  area  cost  of  node  nj;  and  (6)  others.  These  properties  can  be  used  for 
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system  analysis  in  a  high-level  data  flow  description.  Furthermore,  designers  and  synthesis 
algorithms  can  make  design  decisions  in  a  high-level  data  flow  description  based  on  these 


attached  hardware  properties. 
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Figure  17:  Block  design  for  4-output  MFork. 


5  Timing  Model  for  Data  Flow  Specification 

A  high-level  specification  is  useful  not  only  to  describe  the  functional  behavior  of  systems, 
but  also  to  analyze/predict  the  resulting  implementation.  In  this  section  we  first  use  timed  # 

Petri  nets  [25]  to  model  the  timing  behavior  of  basic  blocks.  Then  we  showed  that  the 
composition  of  these  timed  Petri  net  models  can  be  used  to  express  the  timing  behavior 
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of  asynchronous  systems  which  are  composed  of  basic  blocks.  Based  on  the  timing  models 
derived  from  timed  Petri  nets,  the  timing  parameters  and  the  timing  behaviors  of  both  DFG 
and  EDFG  are  defined. 

5.1  Timing  Behavior  Model  for  Basic  Blocks 

There  are  three  kinds  of  basic  blocks,  referred  to  as  the  register  block,  the  computational 
block  and  the  control  block.  The  circuit  structures  of  register  blocks  and  computational 
blocks  have  been  shown  earlier.  Control  blocks  are  those  basic  blocks  corresponding  to 
phantom  constructs  in  EDFG  which  have  more  than  one  input  and/or  more  than  one  output, 
e.g.,  asynchronous  blocks  of  MSelector.  If  we  consider  the  joining  of  multiple  events  in 
control  blocks  as  “an  event”,  the  behavior  of  a  control  block  is  the  same  as  the  behavior  of 
a  computational  block.  For  example,  a  two-input  MJoin  can  be  activated  only  if  both  input 
request  events  arrive.  The  event  of  “both  input  request  events  occur”  is  equivalent  to  the 
input  request  event  of  a  computational  block.  Therefore,  we  only  need  to  model  the  register 
block  and  the  computational  (non- register)  block. 

Since  the  data  transfer  between  blocks  follows  the  two-phase  handshaking  protocol,  the 
data  value  is  always  valid  from  event  request  to  event  acknowledge.  Therefore,  we  only  need 
to  model  the  event  of  control  signals.  There  are  four  events  aissociated  with  the  register  and 
computational  block: 

Ri:  input  data  ready  -  this  corresponds  to  the  input  “request”  signal  transition. 

Ai:  input  data  done  -  this  corresponds  to  the  input  “acknowledge”  signal  transition. 

Ro;  output  data  ready  -  this  corresponds  to  the  output  “request”  signal  transition. 

Ao:  output  data  done  -  this  corresponds  to  the  output  “acknowledge”  signal  transition. 

The  timing  behavior  of  basic  blocks  is  most  easily  described  using  timed  Petri  nets  [25], 
where  transitions  represent  input/output  (control)  events  of  the  block,  and  places  represent 
the  conditions  of  events  in  the  block.  The  delay  from  a  place  (state)  to  a  transition  (event), 
which  is  labeled  on  the  arc  between  them,  represents  the  minimum  time  interval  from  when 
the  condition  is  satisfied  to  when  the  transition  is  activated.  The  state  of  the  block  is 
represented  by  the  distribution  of  tokens  in  the  timed  Petri  net.  Figure  18  shows  the  timed 
Petri  net  models  for  the  register  block  and  the  computational  block,  where  the  tokens  of 
each  model  represent  the  initial  state  of  the  corresponding  block.  By  simulating  the  token 
movement  in  each  Petri  net,  we  can  easily  find  the  relation  among  Figures  18(a)  and  6(a) 
and  11(a)  for  the  register  block,  and  the  relation  among  Figures  18(b)  and  12(a)  and  11(b) 
for  the  computational  block. 

Timing  parameters  There  is  a  delay  associated  with  ejw:h  pair  of  sequential  events.  These 
delays  are  shown  in  Figures  11  and  18.  The  timing  parameters  for  a  register  block  are  defined 
as  follows: 
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Figure  18:  Timing  model  for  asynchronous  blocks. 


D»//  :  forward  latch  time  is  the  time  for  the  register  to  latch  the  input  data  when  the  register 
if  ready  and  the  new  data  just  arrives.  This  corresponds  to  the  delay  from  Ri  to  Ai  in 
the  Petri  net,  where  the  post-Ri  condition  represents  that  input  data  is  ready. 

D,6/  :  backward  latch  time  i?  the  time  for  the  register  to  latch  the  input  data  when  the  input  ® 

data  is  ready  and  the  register  just  becomes  available.  This  corresponds  to  the  delay 
from  Ao  to  Ai  in  the  Petri  net,  where  the  post-Ao  condition  represents  that  the  register 
is  ready. 

Djp  :  propagation  delay  time  is  the  time  from  when  the  input  data  is  latched  to  when  the  • 

output  data  becomes  valid.  This  corresponds  to  the  delay  from  Ai  to  Ro  in  the  Petri 
net. 


The  timing  parameters  for  a  non-register  block  are  defined  as  follows: 

D/p  :  forward  propagation  delay  is  the  time  from  when  all  required  input  data  are  valid  to 
when  all  corresponding  output  data  are  valid.  This  corresponds  to  the  delay  from  Ri 
to  Ro  in  the  Petri  net. 

Dip  :  backward  propagation  delay  is  the  time  from  when  the  output  data  is  being  acknowl¬ 
edged  to  when  the  input  data  being  acknowledged.  This  corresponds  to  the  delay 
from  Ao  to  Ai  in  the  Petri  net. 
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In  addition  to  the  basic  delay  parameters  associated  with  eacli  block,  there  are  two  ilelavs 
associated  with  the  environment.  These  are  denoted  by  (‘'I  and  d2.  <^1  is  dehne  as  the  delav 
from  Ai  to  Ri.  This  is  the  time  between  the  input  acknowledge  event  (completioin  to  the 
next  input  request  event  (starting).  Similarly,  62  is  defined  as  the  delay  between  Ho  to 
Ao.  This  is  the  time  from  the  output  request  (starting)  to  the  next  output  acknowledge 
(completion).  Both  and  62  depend  on  the  response  from  the  environment  of  the  block 
and  are  not  delays  constrained  by  the  hardware  implementation  of  a  block,  so  we  bracket 
the  notations. 


5.2  Timing  Behavior  of  Composed  Blocks 

Two  kinds  of  behavior  models,  the  register  model  and  the  non-register  model,  have  been 
defined  to  describe  the  behavior  of  basic  asynchronous  blocks.  If  we  can  derive  the  composed 
behavior  of  these  two  models  in  timed  Petri  nets,  we  will  be  able  to  derive  the  behavior  for 
any  given  system.  There  are  four  possible  combinations  of  these  two  models  or  these  two 
kinds  of  basic  blocks: 

1.  The  output  of  a  register  block  is  connected  to  the  input  of  a  non- register  block. 

2.  The  output  of  a  register  block  is  connected  to  the  input  of  a  register  block. 

3.  The  output  of  a  non-register  block  is  connected  to  the  input  of  a  non-register  block. 

4.  The  output  of  a  non-register  block  is  connected  to  the  input  of  a  register  block. 

Let  B1  and  B2  be  two  basic  blocks.  Let  event  E  of  block  B  denoted  by  E(B),  e.g.,  Ri(Bl); 
let  timing  parameter  Dn  of  block  B  denoted  by  Dxj;(B),  where  xx  is  either  of  s//,  sbl,  sp, 
fp,  or  bp,  e.g.,  D,//(B1).  When  B1  and  B2  are  connected  with  the  outputs  of  B1  feeding  the 
inputs  of  B2,  then  synchronization  between  these  two  blocks  means  that  Ro(Bl)  is  Ri(B2) 
and  Ao(Bl)  is  Ai(B2).  The  behavior  of  a  composed  block  can  be  generated  by  merging  the 
timed  Petri  nets  of  B1  and  B2  with  Ro(Bl)  =  Ri(B2)  and  Ao(Bl)  =  Ai(B2).  The  timed 
Petri  nets  of  the  four  possible  combinations  are  are  shown  in  Figure  19.  A  formal  proof 
of  the  correctness  of  this  composition  is  not  presented  in  this  report.  These  compositions 
will  be  used  to  model  the  delay  parameters  and  the  performance  analysis  in  the  data  flow 
specification  of  the  following  discussion. 

5.3  Performance  analysis  of  linear  pipelines 

Two  measures  are  defined  to  evaluate  the  performance  of  a  system,  the  completion  time  and 
the  throughput  rate.  The  completion  time  is  a  measure  of  how  long  it  takes  to  complete  the 
execution  of  a  set  of  data  from  inputs  to  outputs  of  the  system.  The  throughput  rate  is  a 
measure  of  how  many  sets  of  data  can  be  processed  by  the  system  per  time  unit  in  steady 
state.  The  inverse  of  the  throughput  rate  is  called  ^he  pipeline  period.  Let  the  completion 
time,  pipeline  period,  and  throughput  rate  be  denoted  by  L,  P,  and  R  respectively.  By 
formulating  the  performance  measures  for  linear  pipelines  at  the  block  level,  we  can  find  a 
proper  timing  model  for  the  high-level  data  flow  specification. 


Ri(Bl) 


Register  ->  Non-register  Non-register  ->  Register 


Figure  19:  The  behavior  of  composed  blocks. 


A  linear  pipeline  is  a  series  of  computations  divided  by  registers.  Although  many  svstems 
are  not  linear  pipelines,  each  input-output  computation  path  in  a  system  can  be  viewed  as  a 
linear  pipeline.  Unlike  synchronous  systems,  the  computation  time  between  two  consecutive 
registers  is  not  fixed.  We  will  analyze  the  performance  for  a  stage,  and  then  expand  the 
analysis  to  the  performance  of  a  pipeline. 


5.3.1  Performance  analysis  for  a  stage 

By  observing  the  timed  Petri  net  descriptions  in  Figure  18  and  the  descriptions  for  the 
composed  block  in  Figure  19,  we  found  that  the  event  Ai  of  a  register  breaks  the  input  and 
the  output  of  the  register  into  two  loops,  with  Ai  being  the  join  event  of  the  two  loops.  We 
define  the  parts  of  hardware  described  by  a  loop  of  events  in  a  Petri  net  as  a  stage.  Two 
timing  peu’ameters  are  defined  for  each  stage.  They  are  the  forward  propagation  delay  time 
and  the  backward  propagation  delay  time,  which  are  denoted  by  FP,  and  BP,  respectively 
for  stage  f.  The  forward  propagation  delay  time  of  a  stage  is  determined  by  the  timing  delay 
from  Ai  of  the  stage’s  input  register  to  Ai  of  the  stage’s  output  register.  The  backward 
propagation  delay  time  of  a  stage  is  determined  by  the  timing  delay  from  Ai  of  the  stage’s 
output  register  to  Ai  of  the  stage’s  input  register.  Figure  20  (a)  is  a  simple  asynchronous 
system  with  computational  blocks  Compl  and  Comp2  betw’een  registers  Regl  and  Reg2. 
Figure  20  (b)  is  the  composed  behavior  of  this  system.  In  this  system,  the  output  of  Regl, 
and  Compl,  Comp2,  and  the  input  of  Reg2  form  a  stage,  and  the  input  of  Regl  and  the 
output  of  Reg2  also  form  a  stage.  The  three  stages  from  input  to  output  are  labeled  stages  0, 
1,  2,  whose  forward  and  backward  propagation  delay  times  are  formulated  as  follows.  (Note 
that  the  input  ^1  (of  Regl)  and  the  output  S2  (of  Reg2)  are  always  assumed  to  be  zero  when 
we  measure  the  performance  of  a  system.  In  other  words,  new  data  is  fed  into  the  system  as 
soon  as  the  input  register  is  free;  output  data  is  removed  zis  soon  as  it  is  available.) 


FPq  =  Dgfi{Regl) 

BPo  =  0 

FPi  =  Dap{Regl)  -f  Dfp{Compl)  +  Dfp{Comp2)  +  D,ji{Reg2) 

BPi  =  Di,p{Comp2)  +  Di,p{Compl)  +  Dsbi{Regl) 

FP2  =  D,p{Reg2) 

BP2  =  Dsu{Reg2) 

By  using  these  parameters,  the  timing  diagram  of  this  system  is  obtained  by  simulation,  and 
it  is  shown  in  Figure  21.  The  measures  of  L,  P,  and  R  can  be  formulated  m  follows. 


L  =  FPo  +  FPi  +  m!ix{{FP2  +  BP2),BPi} 
P  =  FPi  +  BPi 
R  =  1/P 


(1) 

(2) 

(3) 


There  is  one  further  observation  from  the  Petri  net  description  of  Figure  20.  There  is 
always  one  and  only  one  token  in  each  loop  of  the  Petri  net.  The  minimum  time  for  the 
token  moving  around  the  loop  in  any  stage  i  is  {FPi  +  BPi),  so  the  lower  bound  of  pipeline 
period  is  {FPi  -h  BPi).  In  other  words,  the  throughput  rate  of  stage  i  is  less  than  or  equal 
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(b)  Timed  Petri  net  description 
Figure  20:  A  simple  aisynchronous  system. 
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Figure  21:  Completion  time  and  throughput  analysis  of  the  simple  system. 
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Stage  0 


Stage  1 


Stage  n 


Stage  (n+1) 


Figure  22:  A  linear  pipelined  system. 


to  Since  two  consecutive  stages  i  and  (t  +  1)  have  a  joint  event  Ai(Regz),  the 

throughput  rate  of  these  two  stages,  which  equals  the  firing  rate  of  event  Ai(Regz),  is  less 
than  or  equal  to  and  Therefore,  the  pipeline  period  of  these  two 

stages  i  and  (j -I- 1)  is  greater  than  or  equal  to  (FP,  +  BPi)  and  (FP(,+i)  +  PP(,+i));  the  lower 
bound  of  the  system  pipeline  period  equals  the  maximum  of  (PP,  +  PPt)  for  all  stage  i.  A 
further  observation  is  from  the  timing  diagram  in  Figure  21.  The  data  forward  execution  on 
a  stage  is  concurrent  with  the  data  backward  execution  on  its  previous  stage,  e.g.,  PPi  is 
overlapped  with  BPq  in  the  timing  diagram,  and  PPj  is  overlapped  with  BPi.  Therefore, 
the  completion  time  is  mainly  determined  by  the  the  forward  propagation  delay  time. 

5.3.2  Performance  analysis  of  a  linear  pipeline 

From  previous  analysis,  a  stage  is  formed  by  two  registers  without  any  register  block  between 
them.  Without  loss  of  generality,  a  linear  pipeline  is  defined  2is  a  series  of  computation  blocks 
with  a  register  between  any  two  consecutive  computation  blocks,  as  shown  in  Figure  22.  In 
this  figure,  registers  are  labeled  RegO,  Regl,  . . .,  Regn,  and  the  computation  block  between 
Reg(t  —  1)  and  Regt  is  labeled  Compi  for  1  <  t  <  n.  There  are  (n  +  2)  stages  in  the  system, 
and  the  forward  and  backward  propagation  delay  times  of  these  stages  are  defined  below. 

FPo  =  Dtfi{RegO) 

BPo  =  0 

FPi  =  Dsp{Reg{i  -  1))  +  Dfp{Compi)  +  D,}i{Regi),  for  1  <  t  <  n 

BPi  =  Di,p{Compi)  +  Dabi{Reg(i  -  1)),  for  1  <  t  <  n 

FP(n-^\)  =  D^{Regn) 

BP(n+i)  =  DtuiRegn) 

By  generalizing  the  result  cf  the  previous  example,  the  performance  measures  of  this  system 
can  be  formulated  as  folio -vs. 

n+1 

T  =  ^  PPj  +  PP(„+i)  +  A, 

i=0 

where  A  =  max{0,maxl^„{PP.  -  Ej^/+i  FPj  -  PP(„+i)}}  (4) 
P  =  i?^{(PPi  +  PP)}  (5) 

R  =  1/P  (6) 
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igure  23:  Timing  model  for  storage  nodes. 


In  equation  (4),  A  is  most  likely  to  be  zero  because  usually  FPj  >  BPi.  We  conve¬ 

niently  assume  A  to  be  zero. 


5.4  Timing  Model  for  DFG/EDFG 

After  understanding  the  timing  behavior  of  asynchronous  systems  at  the  circuit  level,  we 
defined  a  timing  model  for  the  high-level  data  flow  specification  which  reflects  the  behavior 
of  the  low-level  implementation.  The  key  to  model  timed  behavior  for  data  flow  specification 
is  the  interpretation  of  tokens  in  the  data  flow  specification  with  respect  to  the  events  in  the 
block  model,  i.e.,  the  token-handshaking  protocol  relation  described  in  Section  3.3.  Since  an 
EDFG  is  an  abstract  representation  of  an  aisynchronous  system  and  a  DFG  description  can 
be  replaced  by  an  EDFG  description,  we  begin  with  the  timing  model  for  EDFG. 

5.4.1  Timing  Model  for  EDFG 

ein  EDFG  is  an  abstract  representation  for  an  asynchronous  system,  so  the  timing  parame¬ 
ters,  D,//,  Djw,  Dap,  Dfp,  and  Dbp,  are  directly  attached  to  corresponding  nodes  in  the  EDFG 
description.  For  storage  nodes,  Daji/abi  and  {Da/t/abt  +  D,p)  correspond  to  input  token  ab¬ 
sorption  time  and  the  time  of  moving  a  token  from  input  to  output;  the  usage  of  Daji  or 
Dabi  in  the  above  delay  time  calculation  depends  on  the  state  of  the  storage  node  when  the 
input  token  arrives.  The  timed  state  diagram  for  a  storage  node  is  shown  in  Figure  23, 
where  each  state  corresponds  to  a  possible  token  distribution  in  Figure  18(a).  Two  labels 
are  attached  to  each  directed  arc  between  two  states  in  Figure  23:  the  delay  between  the 
two  states  and  the  event  which  occurs  between  two  states.  For  example.  Si  in  Figure  23 
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Figure  24:  Timing  model  for  non-storage  nodes. 


corresponds  to  the  initial  token  distribution  shown  in  Figure  18(a).  After  event  Ri  occurs, 
Si  transits  to  S2  and  the  time  interval  between  Si  and  S2  is  ^1.  S2  in  Figure  23  corresponds 
to  the  token  distribution  of  the  post-Ri  condition  and  the  post-Ao  condition  in  Figure  18(a). 
The  unshaded  part  of  Figure  23  represents  the  arrival  of  input  token  at  the  state  that  the 
register  does  not  hold  another  data,  and  Dt/i  is  the  latch  time  for  the  token  absorption;  the 
shaded  part  represents  the  arrival  of  input  token  at  the  state  that  the  register  is  holding 
another  data,  and  Dgbi  is  the  latch  time  for  the  token  absorption.  ^1,  62,  63,  64,  65,  and  66 
are  delays  associated  with  environment,  where  61,  63,  64,  and  65  represent  the  input  token 
arrival  time  and  where  62  and  66  represent  the  output  token  removal  time. 

For  phantom  nodes.  D/p  represents  the  time  from  the  generation  of  an  input  token  to  the 
generation  of  the  corresponding  output  token,  and  Dbp  represents  the  time  from  the  removaJ 
of  an  output  token  to  the  removal  of  the  corresponding  input  token.  In  order  to  allow  the 
reader  to  understand  that  multi-input/multi-output  phantom  nodes  share  the  same  behavior 
of  single-input  single-output  phantom  nodes,  we  present  a  timed  state  diagram  for  a  two- 
input  MJoin  in  Figure  24,  where  each  state  corresponds  to  a  possible  token  distribution  in 
Figure  18(b).  For  example.  Si  in  Figure  24  corresponds  to  the  initial  token  distribution 
shown  in  Figure  18(b).  After  both  Ri.O  and  Ri.l  occur.  Si  transits  to  S2.  S2  in  Figure  24 
corresponds  to  the  token  distribution  of  the  post-Ri  condition  in  Figure  18(b). 

The  timing  model  described  in  this  section  is  an  enhancement  for  the  (extended)  data 
flow  model  in  Section  3.3  and  Section  4.3.2.  Comparing  the  unshaded  part  of  Figure  23 
with  Figure  6(b)  for  the  storage  node,  S3  is  an  extra  state  in  the  timed  extended  data  flow 
model,  and  S3,  which  is  between  S2  and  S4,  describes  the  transient  states  in  Figure  6(a);  S5, 
which  represents  the  register  resetting  state  after  the  removal  of  the  output  token,  also  is  an 
extra  state  in  the  timed  model,  and  it  is  merged  in  the  idle  state  ub  Figure  6(a).  Comparing 
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Figure  24  with  Figure  r2(b)  for  the  phantom  node,  S4  is  an  extra  state  in  the  timed  extended 
data  flow  model.  Again,  S4  describes  the  transient  state  in  Figure  r2(a).  At  this  point,  the 
timed  extended  data  flow  model  has  fully  reflected  the  low-level  behavior  in  the  high-level 
description. 


i^l 

Ri 


Din 

Ai 


SI 

(idle) 


S2  S3 

(input  arrived)  (data  computing) 


Ro 


Dsp+  Dfp 


[521 


Ao 


S5 

(resetting) 


S4 

(output  produced) 


Figure  25;  Timing  model  for  DFG  nodes. 


5.4.2  Timing  Model  for  DFG 

Each  node  in  a  DFG  corresponds  to  an  EDFG  phantom  node  plus  a  storage  at  each  input  0 

of  this  EDFG  node  (see  Figure  15).  Therefore,  we  can  simply  use  the  timing  model  of 

the  EDFG  to  simulate  a  DFG.  On  the  other  hand,  we  can  develop  a  timing  model  for  the 

DFG  by  using  the  timed  Petri  net  model  of  the  composition  of  the  register  block  and  the 

non-register  block  in  Figure  19.  The  timed  state  diagram  for  a  DFG  node  is  shown  in 

Figure  25.  Comparing  it  with  Figure  6(b),  there  are  two  extra  states  in  Figure  25,  where  S3  0 

corresponds  to  the  transient  states  for  the  data  latch  and  the  data  computation,  and  where 

S5  corresponds  to  the  transient  state  for  functional  unit  resetting.  Based  on  the  analysis 

in  Section  5.3.2,  we  further  simplify  the  timing  model  shown  in  Figure  26,  where  we  need 

only  two  timing  parameters,  Dfp  and  Dbp-  Referring  to  Section  5.3.2,  D"ji  in  Dfp  is  the 

forward  latch  time  of  the  output  register  of  the  function  unit,  and  D'j,,  in  Dbp  and  Z)'p  in  # 

Dfp  are  the  backward  latch  time  and  the  propagation  delay  time  of  the  input  register  of 

the  function  unit.  In  other  words,  we  adopt  the  stage  delay  parameters  to  the  simplified 

model.  This  simplified  DFG  timing  model  complys  with  the  data  flow  model  in  Section  3.3, 

and  it  reduces  the  simulation  complexity,  as  well  as  provides  a  simpler  model  for  high-level 

^The  part  corresponding  to  the  shaded  part  in  Figure  23  is  not  shown  in  this  model.  ^ 
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synthesis  problems. 


According  to  the  timed  behavior  of  a  DFG/EDFG,  we  can  simulate  and  analyze  a  system 
from  the  behavior  description.  Furthermore,  the  attached  parameters  and  analysis  formula¬ 
tions  can  be  used  as  measures  in  a  high-level  synthesis.  These  will  be  discussed  in  the  next 
section. 


Dbp=  Dbp  + 


Ro 


p^  +  D(j,+  DJn 


S5  S4 

(resetting)  (output  produced) 

Figure  26:  Simplified  timing  model  for  DFG  nodes. 


6  Data  Flow  Specification  for  High-Level  Synthesis 

One  of  main  reasons  to  use  a  data  flow  specification  is  to  have  a  sound  functional  and  timing 
model  to  describe  system  behavior  so  that  designers  can  make  design  decisions  at  the  high 
level.  In  this  section  we  first  discuss  the  transformations  which  preserve  the  functionality 
of  DFG  descriptions  and  result  in  different  realizations  with  different  performances.  Two 
kinds  of  function- preserved  transformations  are  discussed:  sharing  schemes  which  provide 
design  templates  to  map  a  DFG  to  another  DFG  with  fewer  atomic  function  nodes;  and  local 
transformations  which  are  rules  to  map  a  substructure  of  a  DFG  to  einother  substructure  so 
that  the  area  and/or  the  performance  of  the  new  DFG  is  improved.  Based  on  the  sharing 
schemes  in  this  section  and  on  the  performance  model  described  in  the  previous  section, 
we  address  the  sequencing  and  allocation  problem  in  which  we  want  to  find  sequences  and 
allocations  of  atomic  functions  (of  data  paths)  with  optimized  or  near-optimized  system 
performance  and/or  area  consumption. 


6.1  Sharing  Schemes 

Resource  sharing  is  broadly  used  to  reduce  the  hardware  size  as  well  as  the  implementation 
cost  in  digital  system  designs.  There  are  two  issues  related  to  resource  sharing  in  the  design 
process: 

1.  Allocation  -  For  each  operator,  what  are  those  operations  going  to  be  executed  by  the 
shared  operator? 

2.  Sequencing  -  For  each  operator,  what  is  the  execution  order  of  those  operations  which 
share  the  operator? 

Here  we  focus  on  the  transformations  of  a  DFG  into  another  DFG  for  a  given  sequence  and 
allocation  of  operations  under  the  DFG  paradigm,  namely,  sharing  schemes. 

In  our  synthesis  system,  each  node  in  a  DFG  is  implemented  by  a  hardware  module,  i.e., 
an  operator.  Assume  that  there  is  no  multi-function  operator  such  as  an  ALU.  Therefore,  a 
sharing  scheme  will  transform  an  (original)  DFG  into  a  (mapped)  DFG  in  the  following  way: 
the  same  type  of  N  atomic  function  nodes  of  the  original  DFG  is  replaced  by  M  same  type 
of  nodes  with  proper  control  and  routing  structure  in  the  mapped  DFG,  where  N  >  M;  the 
sequence  of  operations  in  the  original  DFG  is  preserved  by  the  sequence  of  operations  in  the 
mapped  DFG,  i.e.,  system  functionality  is  preserved.  There  are  three  parts  in  each  sharing 
scheme:  the  shared  unit(s),  the  control  part,  and  the  data  routing  part,  which  are  together 
referred  to  as  the  sharing  structure.  Figure  27  demonstrates  an  abstract  sharing  scheme  for 
N  =  A  and  M  =  2.  In  the  mapped  DFG,  the  output  of  91(92)  is  executed  by  mi,  and  this 
execution  generates  an  output  as  the  input  of  95(96);  the  output  of  93(94)  is  executed  by 
m2,  and  this  execution  generates  an  output  as  the  input  of  97(98)-  The  control  part  and  the 
routing  part  in  the  sharing  structure  are  used  to  ensure  that  the  sequence  of  operations  in 
the  mapped  DFG  preserves  the  sequence  of  operations  in  the  original  DFG.  Therefore,  the 
mapped  DFG  uses  fewer  functional  nodes  than  the  original  one,  while  performing  the  same 
functions;  however,  the  mapped  DFG  h2is  extra  control/ routing  nodes. 

Classification  and  notation  Let  the  same  type  of  N  atomic  function  nodes  in  the  original 
DFG  be  labeled  01,02, ...  ,vs,  and  the  M  same  type  of  nodes  in  the  mapped  DFG  be 
labeled  mi,m2, . . .  ,m\f.  There  are  two  classes  of  sharing  schemes:  sharing  schemes  with 
fixed  allocation  and  sharing  schemes  with  dynamic  allocation.  In  sharing  schemes  with  fixed 
allocation,  the  execution  of  each  node  Uj  is  aissigned  to  a  specific  node  mj.  Let  GNj  be 
the  set  of  u’s  assigned  to  mj,  i.e.,  there  are  \GNj\  operations  sharing  operator  rrij.  On  the 
other  hand,  the  execution  of  each  node  Oi  in  sharing  schemes  with  dynamic  allocation  is  not 
assigned  to  a  specific  node  mj.  Instead,  the  execution  of  node  Oi  is  dynamically  assigned 
to  any  operator  mj,  which  usually  is  an  idle  one  at  the  time  when  the  input  data  of  node 

Oi  is  available.  Let  GNj^^j^ . j^  be  the  set  of  vs  to  be  dynamically  allocated  to  the  set  of 

k  operators  mj^,mjj,. . .  ,mj^,  where  \GNj,jj . >  k,  i.e.,  \GNj^JJ . operations  sharing 

the  k  operators.  Without  loss  of  generality,  the  presentation  of  following  schemes  will  use 

|GiVj|  =  4  for  sharing  schemes  with  fixed  allocation  and  \GNj,j^ . j^|  =  4  and  A:  =  2  for 

sharing  schemes  with  dynamic  allocation.  The  original  four  operations  for  \GNj\  =  4  or 
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(a)  Before  applying  sharing  scheme 


(b)  After  applying  sharing  scheme 


Figure  27:  An  abstract  sharing  scheme  with  fixed  allocation  for  iV  =  4,  M  =  2. 


\GNj^jj\  =  4  are  shown  in  Figure  28,  where  II,  14  can  be  either  connected  to  input 
port(s)  of  the  original  DFG  or  connected  to  output(s)  of  some  nodes  in  the  DFG,  and  01, 
. . 04  can  be  either  connected  to  output  port(s)  of  the  DFG  or  connected  to  input(s)  of 
some  nodes  in  the  DFG.  For  example,  one  DFG  has  the  four  operations  connected  serially, 
e.g.,  01,  02,  and  03  are  connected  to  12, 13,  and  14  respectively,  and  II  and  04  are  connected 
to  the  input  port  and  the  output  port  of  the  DFG. 

6.1.1  Sharing  Schemes  with  Fixed  Allocation 

The  problem  in  a  fixed  allocation  is  to  map  a  set  of  operations  into  a  shared  unit  (opera¬ 
tor).  Figure  27  shows  the  outline  of  this  kind  of  sharing  scheme.  Besides  the  shared  unit, 
the  control  part  generates  condition  tokens  to  control  the  data  routing  part  so  that  these 
operations  are  executed  by  the  shared  unit  in  a  certain  order.  Two  ordering  schemes  for 
fixed  allocation  are  presented.  One  is  the  scheme  with  variable  sequence,  in  which  the  order 
of  operations  processed  by  a  sharing  structure  is  based  on  the  first-come-first-served  (FCFS) 
ordering  scheme.  The  other  is  the  scheme  with  fixed  sequence,  in  which  the  order  of  opera¬ 
tions  processed  by  a  sharing  structure  is  pre-determined  by  system  designers  or  scheduling 
algorithms  during  the  process  of  high-level  synthesis. 

Fixed>allocation  sharing  scheme  with  variable  sequence  Figure  29(a)  is  a  fixed- 
allocation  sharing  scheme  with  variable  sequence  for  |GAj|  =  4.  In  this  scheme,  an  MSelector 
and  an  MDistributor  with  condition  input  form  the  data  routing  p2irt.  Four  i2([  ])  functions 
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and  a  CFvs4  function  form  the  control  part.  R{[  ])  is  an  atomic  function  which  passes  the  # 

input  token  to  the  output  without  passing  the  input  data  value.  CFvs4*  is  a  macro  function 

which  generates  condition  tokens  to  indicate  which  input  token  is  available  so  that  a  proper 

data  path  in  the  data  routing  part  is  open.  Figure  29(b)  is  the  DFG  definition  of  function 

CFvs4,  where  C( (const))  is  an  atomic  function  which  generates  a  token  with  constant  data 

value  (const)  if  it  receives  a  null  data  token.  For  example,  12  has  data  available,  and  it  • 

activates  the  second  /2([  ])  to  generate  a  null  data  token  to  CFvs4.  Due  to  the  token  from 

the  second  input  of  CFvs4,  “'601”  *  data  tokens  are  generated  on  both  outputs  of  CFvs4, 

and  they  will  open  the  data  route  from  input  1  of  the  MSelector  to  the  output  1  of  the 

MDistributor.  In  case  there  is  more  than  one  input  data  available,  an  Arbiter  in  CFvs4  will 

decide  the  order  of  routing  paths/operations  based  on  an  FCFS  ordering  scheme.  • 

Fixed-allocation  sharing  scheme  with  fixed  sequence  Figure  30(a)  is  a  fixed-allocation 
sharing  scheme  with  fixed  sequence  for  =  4,  and  it  is  similar  to  the  sharing  scheme 

in  Figure  29(a)  except  for  the  control  part.  CFfs4,  which  forms  the  control  part  of  Figure 
30(a),  is  a  macro  function  which  generates  fixed  sequence  of  condition  tokens  so  that  data  ^ 

paths  are  open  in  a  pre-determined  order.  Figure  30(b)  is  the  DFG  definition  of  function 
CFfs4,  where  COUNT4  is  an  atomic  function  which  generates  the  next  condition  token  from 
the  current  condition  token.  In  this  example,  CFfs4  is  a  two-bit  cyclic  counter  with  the  ini¬ 
tial  value  “'600”,  i.e.,  it  generates  the  sequence  of  conditional  tokens,  “'600”,  “'601”,  “'610”, 
and  “'611”  repeatedly.  Therefore,  the  order  of  operations  is  MUL(Il)  followed  by  MUL(I2)  ® 

followed  by  MUL(I3)  followed  by  MUL(I4).  Even  though  the  data  at  14  is  available  before 
the  data  13,  the  data  at  14  cannot  be  processed  until  MUL(I3)  is  completed.  The  order  of 
operations  can  be  changed  either  by  changing  the  input/output  location  or  by  changing  the 
sequence  generator.  ^ 

'^CFvs  stands  for  the  control  function  with  variable  sequence. 

*'6,  'o,  and  '6  are  used  to  lead  the  binary,  octal,  and  hexadecimal  numbers  respectively. 
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01  02  03  04 

(a)  Sharing  scheme  with  fixed  allocation  (b)  Definition  of  CFvs4 

with  variable  sequence 


Figure  29:  A  sharing  scheme  with  fixed  allocation  with  variable  sequence. 
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(a)  Sharing  scheme  with  fixed  allocation  (b)  Definition  of  CFfs4 

with  fixed  sequence 

Figure  30:  A  sharing  scheme  with  fixed  allocation  with  fixed  sequence. 
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Functionality  preservation  In  order  to  show  that  the  mapped  DFG  preserves  fuircliou- 
ality,  we  need  to  show  that  the  order  of  operation  execution  in  the  original  DFG  is  preserved 
by  the  mapped  DFG.  In  an  FCFS  ordering  scheme,  no  specific  order  enforces  the  operations 
which  share  the  same  operator.  In  fact,  any  order  of  these  operations  is  acceptable  to  the 
sharing  structure.  Therefore,  the  mapped  DFG  which  uses  the  FCFS  ordering  scheme  follows 
and  preserves  the  operation  execution  order  of  the  original  DFG.  On  the  other  hand,  the 
sharing  scheme  with  fixed  sequence  may  have  a  problem  preserving  the  operation  execution 
order.  For  example,  II  is  connected  to  02  in  the  original  DFG,  so  II  has  to  be  executed 
after  12  is  executed  to  generate  output  02.  Therefore,  the  sequence  provided  by  Figure 
30(b)  would  not  work.  In  order  to  use  the  sharing  scheme  with  fixed  sequence,  designers  and 
synthesis  algorithms  need  to  give  a  sequence  which  preserves  original  operation  ordering, 
i.e.,  data  dependency.  There  are  many  ways  to  solve  this  problem,  e.g..  interchanging  II,  12 
and  interchanging  01,  02  will  fix  the  problem  in  Figure  30. 


(a)  Before  applying  sharing  scheme 
Figure  31:  A  abstract  sharing  scheme 


(b)  After  applying  sharing  scheme 
dynamic  allocation  for  A  =  4,  M  =  2. 


6.1,2  Sharing  Scheme  with  Dynamic  Allocation 

The  problem  in  this  kind  of  sharing  scheme  is  to  map  a  set  of  operations  into  a  set  of  shared 
units  (operators)  dynamically.  Figure  31  shows  the  outline  of  this  kind  of  sharing  scheme. 
In  this  shciring  scheme,  the  data  routing  part  should  have  routes  from  every  data  input 
to  every  shared  unit  as  well  as  routes  from  every  shared  unit  to  every  data  output.  The 
condition  tokens  generated  by  the  control  part  not  only  need  to  indicate  the  input-to-output 
to  be  executed,  but  also  need  to  indicate  which  shared  unit  is  used.  By  generalizing  the 
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Figure  32:  A  sharing  scheme  with  dynamic  allocation. 


operation  ordering  schemes  and  creating  different  operator  allocation  schemes,  there  may 
be  many  different  kinds  of  sharing  schemes  under  this  category.  Here  we  only  present  one 
sharing  scheme.  In  this  sharing  scheme,  the  FCFS  ordering  scheme  is  used  for  the  operation 
ordering.  By  assuming  that  the  first  started  operator  is  first  relezised,  the  sequentially  cyclic 
ordering  is  used  for  the  operator  allocation,  i.e.,  operator  1  to  operator  k  is  sequentially 
and  cyclically  allocated.  Figure  32  is  a  sharing  scheme  with  dynamic  allocation  of  this 
kind  for  |  =  4.  In  this  sharing  scheme,  the  four-input  MSelector  and  the  four-output 

MDistributor,  which  control  the  input-output  routing,  and  the  two-output  MDistributor  and 
the  two-input  MSelector,  which  control  the  operator  routing,  form  the  data  routing  part. 
Four  il([  ])  functions,  function  CFvs4,  which  generate  tokens  to  control  the  FCFS  input- 
output  routing,  and  a  1-bit  counter  CFfs2,  which  generates  tokens  to  control  the  operator 
routing,  form  the  control  part.  I  is  an  atomic  function  which  passes  input  token  to  output, 
and  it  is  used  to  store  a  condition  token  for  an  unfinished  operation  in  the  sharing  structure, 
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e.g.,  there  may  exist  two  sets  of  condition  tokens  for  two  operations  in  the  structure  at  the 
same  time  in  Figure  32.  For  example,  12  has  data  available  first,  and  it  activates  the  second 
/2([  ])  and  makes  CFvs4  generate  “'601”  data  tokens  to  the  condition  port  of  the  four-input 
MSelector  and  to  the  condition  port  of  the  four-output  MDistributor.  Meanwhile.  CFfs2 
generates  “'60”  to  the  condition  port  of  the  two-input  MDistributor  and  to  the  condition 
port  of  the  two-output  MSelector.  Then  the  data  12  is  passed  to  the  input  of  the  left  .ML'L 
in  Figure  32.  If  I4  has  data  available,  right  after  12  has  data  available.  Because  one  /  at 
the  output  of  CFvs4  keeps  the  condition  token  “'601”  for  12,  CFvs4  can  generate  new  “'611” 
tokens  for  14.  Similarly,  CFfs2  can  generate  “'61”  after  12  is  passed  through  the  two-input 
MDistributor  of  two  MULs.  Therefore,  data  at  14  can  start  being  executed  by  the  right 
MUL  even  though  the  data  of  12  is  still  in  the  sharing  structure. 

6.1.3  Sharing  Scheme  with  Micropipelined  Shared  Unit 

A  node/function  which  is  micropipelined  [28]  by  being  partitioned  into  a  pipeline,  i.e.,  this 
node  becomes  a  macro  function  defined  by  a  series  of  sub-functions.  For  example,  MUL 
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Figure  33:  A  sharing  scheme  with  micropipelined  shared  units. 
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Figure  34:  Timing  behavior  of  micropipelined  sharing  scheme. 


may  be  partitioned  into  three  sub-functions,  MULi,  MUL2,  and  MUL3,  and  MUL(t)  equals 
MUL3(MUL2(MULi((t))))  for  each  input  t,  which  includes  two  operands.  By  using  mi¬ 
cropipelined  shared  units,  there  are  still  two  classes  of  allocation  methods  as  stated  pre¬ 
viously.  However,  we  may  use  a  fixed  allocation  sharing  scheme  with  a  micropipelined 
function  unit,  whose  behavior  resembles  the  behavior  of  a  sharing  scheme  with  dynamic 
allocation  with  multiple  shared  units.  Figure  33  shows  a  sharing  scheme  with  a  three-stage 
micropipelined  shared  unit,  and  it  is  a  fixed  allocation  sharing  scheme  with  FCFS  ordering 
scheme  for  AT  =  4  and  Af  =  1.  The  behavior  of  Figure  33  is  similar  to  the  behavior  of 
Figure  32  with  iV  =  4  and  M  =  3.  Without  considering  the  control  overhead,  the  timing 
and  resource  usage  diagram  in  Figure  34  shows  the  behavioral  resemblance  between  these 
two  schemes,  where  mi,2,3  represents  three  stages  of  the  micropipelined  shared  unit,  the  exe¬ 
cution  time  for  each  stage  is  40  nsec,  rui,  m2,  and  m3  represent  three  shared  units,  and  four 
operations  are  executed  in  the  order  of  II,  12,  13,  and  14. 

6.1.4  Effects  of  Sharing  Schemes 

Since  a  sharing  scheme  is  a  fixed  template  with  respect  to  the  number  of  sharing  operations, 
the  number  of  shared  operators,  and  the  kind  of  scheme,  we  can  easily  estimate  the  effects 
of  the  shzuring  scheme  such  as  area  and  performance. 

Area  Although  a  sharing  scheme  maps  one  DFG  into  another  DFG  with  fewer  atomic 
function  nodes,  some  extra  nodes  are  required  to  form  the  routing  part  and  the  control  part 
in  the  mapped  DFG.  Therefore,  the  area  gain  of  a  mapping  equals  the  area  of  the  eliminated 
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atomic  functions;  the  area  overhead  of  the  mapping  equals  the  area  of  these  extra  nodes. 
Because  resource  sharing  always  reduces  system  performance,  if  the  area  overhead  is  greater 
than  the  area  gain,  then  this  mapping  should  be  abandoned. 

Performance  After  sharing  schemes  are  applied,  the  e.xecution  time  of  each  operation 
which  shares  an  operator  with  other  operations  is  increased  by  the  routing  delay  and  possible 
control  delay.  Besides  the  overhead  caused  by  the  sharing  structure,  the  starting  e.xecution 
time  of  some  operations  is  postponed  due  to  the  sequence  enforced  by  the  sharing  structure. 

The  performance  and  area  effects  of  DFG  mapping  can  be  easily  obtained  by  simulating 
the  mapped  DFG  and  by  counting  the  area  of  the  mapped  DFG.  Furthermore,  we  need  to 
quantify  these  effects  in  terms  of  the  kind  of  sharing  scheme  and  the  number  of  sharing 
operations  and  shared  operators,  so  that  these  quantities  can  be  used  in  high-level  design 
decisions  such  as  sequencing  and  allocation  problem. 

Example  [effects  of  sharing  schemes  with  fixed  allocation  having  fixed  sequence] 
Let  us  tinalyze  the  effects  of  sharing  schemes  shown  in  Figure  30.  Assume  that  there  are  N 
operations  which  share  one  operator,  where  N  >  1.  Let  the  area  cost  of  the  operator  for 
the  operation  be  Aop;  let  the  area  cost  of  the  n-input  MSelector  (n-output  MDistributor)  be 
n*  At  (n  ♦  Ad)  with  the  constant  factor  A,  (Aj);  and  let  the  area  cost  of  the  n-bit  counter  be 
n*  Ac  with  the  constant  factor  Ac.  The  area  gain/overhead  analysis  for  the  shtiring  scheme 
is  shown  as  follows: 

Area  gain:  (A^  —  1)  operators; 

Area  overhead:  an  TV-input  MSelector, 

an  A^-input  MDistributor,  and 
a  [/o5f2TV]-bit  counter; 

Total  area  gain:  {N  —  1)  *  Agp  —  N  *  (A,  -1-  Aj)  —  \l0g2N]  *  Ac 

Based  on  the  analysis  of  our  design  library,  the  forward  propagation  delay  time  of  the 
n-input  MSelector,  F  Pm  selector  (n),  can  be  formulated  as  ki  +  \log2n\  *  k2i  where  fci,  ^2 
are  constant  factors  and  ki  ^  k2  [33].  Generally  we  can  ignore  k2,  and  assume  that 
F PMSeiector{n)  =  F Pm Selector  is  a  Constant.  The  backward  propagation  delay  time  of  the 
n-input  MSelector,  B  Pm  Selector  {n),  and  the  forward  propagation  delay  time  of  the  n-output 
MDistributor,  FPMDi$tributor{n),  and  the  backward  propagation  delay  time  of  the  n-output 
MDistributor,  B  Pm  Distributor  {n),  can  be  similarly  formulated,  so  they  are  assumed  to  be  con¬ 
stants  B  PMSeiectori  F  Pm  Distributors  and  B  Pm  Distributor  •  The  performance  overhead  analysis 
for  the  sharing  scheme  is  shown  as  follows: 

Overhead  of  forward  propagation  delay  time:  FPov  =  F  PMSeiector  +  F  Pm  Distributor 

Overhead  of  backward  propagation  delay  time:  BPov  =  BPMSeiector  +  B  Pm  Distributor 

Furthermore,  FPMSeiector  and  F  Pm  Distributor  can  be  assumed  to  be  zero  in  the  sharing  struc¬ 
ture  due  to  the  parallelling  of  the  data  computation  and  the  control  generation,  which  is  a 
hardware  implementation  issue  and  is  not  discussed  in  this  report. 
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6.2  Local  Transformations 

Algorithmic  transformations  can  be  used  to  improve  the  design  efficiency  at  the  behav¬ 
ioral  level  so  that  the  resulting  design  description  can  generate  a  suitable  implementation 
[27,  29,  31].  Most  transformations  use  the  peephole  optimization  technique,  used  similarly 
in  the  compiler  design,  and  are  therefore  called  local  transformations  in  this  context.  The 
biases  in  behavioral  level  descriptions  are  caused  by  the  designers’  coding  style  or  gener¬ 
ated  by  other  transformations  such  as  sharing  schemes.  Transformations  are  developed  to 
reduce  the  number  of  operations,  to  reduce  the  size  of  control  structures,  to  reduce  length 
of  the  critical  path,  to  remove  the  redundancy,  and  so  on.  Snow  has  systematically  devel¬ 
oped  transformations  for  the  C-MU  RT-CAD  system  [27].  Most  of  Snow’s  transformations 
are  general  enough  to  have  analogous  transformations  in  our  system  such  as  dead  activity 
elimination,  redundant  activity  elimination,  select  factoring/combination,  etc.,  so  we  will  not 
describe  the  available  transformations  in  our  system.  Instead,  we  will  present  a  few  transfor¬ 
mations  to  show  the  role  of  tokens  and  data  types  in  DFG/EDFG  transformations.  Because 
of  the  token  model,  the  correctness  of  these  transformations  can  be  eeisily  proven  by  symbolic 
(token)  simulation.  Later  we  present  one,  transformation  which  is  often  used  to  reduce  the 
routing  part  and  the  control  part  of  sharing  structures,  and  it  will  be  extensively  used  in  the 
examples  of  the  next  section. 

Symbolic  token  simulation  Figure  35  is  a  simple  local  transformation.  We  can  show 
that  these  two  DFGs  are  equivalent  by  simulation.  Given  a  token  with  data  value  d,  md 
data  type  ti  for  input  li  for  i  =  1,2  to  both  DFGs,  a  token  is  produced  with  data  value 
{<^11^2)  and  data  type  (ti,<2)  at  each  of  outputs  01  and  02  for  both  DFGs.  Therefore,  they 
are  functionally  and  behaviorally  equivalent. 


Data  type  matching  In  the  transformations  of  DFG,  we  need  to  consider  not  only  the 
equivalence  of  token  generation  but  also  the  equivalence  of  the  token  value  and  the  token 
type.  Figures  36(a)  and  (b)  are  two  DFGs,  which  look  equivalent.  We  can  show  these  two 
DFGs  are  not  equivalent  by  simulation.  Given  a  token  with  data  value  di  and  data  type  f,  for 
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Figure  36:  Local  Transformation  2. 
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input  It  for  t  =  1,2,3  to  both  DFGs,  a  token  is  produced  with  data  value  {di,{d2,d3))  and 
data  type  (<i,  (<2»^3))  at  the  output  of  Figure  36(a),  and  a  token  is  produced  with  data  value 
(di,  d2,  da)  and  data  type  <3)  at  the  output  of  Figure  36(a).  They  are  the  same  in  terms 
of  hardware  implementation,  but  they  are  not  the  same  in  terms  of  the  DFG  specification. 
With  atomic  functions  for  data  type  conversion,  Figures  36(b)  and  (c)  are  equivalent,  and 
Figures  36(a)  and  (d)  are  equivalent.  i2([l,  1  o2, 2o2])  and  i?([l,  [2,3]])  are  atomic  functions 
called  routers  in  our  system.  Router  functions  rearrange  input  data  by  copying,  repeating, 
and  shuffling  input  data.  The  notation  of  router  functions  is  beised  on  Backus’  FP  [2],  where 
i  is  the  FP  selector  function,  square  bracket  [. . .]  is  the  functional  form  of  construction,  and 
circle  o  is  the  functional  form  of  composition.  These  are  defined  as  follows: 

i :  (xi,  xj, . . . ,  x„)  =  Xj,  for  any  positive  i  <  n; 

[/i, /z,  •  •  • , /n]  •  ^  =  (/i  •  2  :  X, . . . ,  /n  :  x); 

fog.x  =  f:{g:x), 

where  x  and  {x\,X2, . . .  ,x„)  are  input  objects  of  functions,  and  /i,  /2,  . . .,  /„,  /,  and  g  aie 
functions,  e.g.,  FP  selector  functions.  Therefore,  we  can  show  that  Figures  36(b)  and  (c)  are 
equivalent  as  follows. 


[l,lo2,2o2]:(da,(d2,d3)) 


=  (1  :  (di,  (d2,d3)), 

lo2:(d,,(d2,d3)),2o2:(d,,(d2,d3)) 
=  (di,l:(d2,d3),2:(d2,d3)) 

=  (di,d2,d3) 
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(a)  Before  transformation  (b)  After  transformation 

Figure  37:  Local  Transformation  3. 


Similarly,  we  can  show  that  Figures  36(a)  and  (d)  are  equivalent,  i.e., 
[I,[2,3]]:(di,d2,d3)=(di,(d2,d3)) 

A  transformation  for  sharing  scheme  reduction  Figure  37(a)  is  a  DFG  under  a  special 
condition,  which  often  appears  after  sharing  schemes  are  applied.  In  this  figure,  two  output 
ports  of  an  MDistributor  are  connected  to  two  input  ports  of  an  MSelector,  and  the  order 
of  tokens  which  pass  from  the  input  of  the  MDistributor  to  the  output  of  the  MSelector  are 
preserved,  e.g.,  in  Figure  37(a),  j  obtains  a  token  after  t  obtains  a  token  from  X,  then  Y 

obtains  a  token  from  q  after  Y  obtains  a  token  from  p.  In  this  situation,  we  can  combine 

these  two  paths  into  one  with  the  MDistributor  reducing  one  output  port  and  the  MSelector 
reducing  one  input  port.  In  addition  to  reductions  these  ports,  the  control  token  generation 
for  the  MDistributor  and  the  MSelector  needs  to  be  changed  for  the  transformed  DFG. 
Figure  37(a)  is  transformed  to  Figure  37(b)  with  one  possible  mapping  for  the  outputs  of 
the  MDistributor  and  the  input  of  the  MSelector  as  shown  below. 

Output  port  index  mapping  for  the  MDistributor  with  ij  =  i: 

Before  transformation  After  transformation 

X  X,  if  0  <  X  <  f; 

i,  if  X  =  *  or  X  =  j] 

X  —  1,  if  i  <  X  <  n  and  x  ^  j. 

Input  port  index  mapping  for  the  MSelector  with  pq  =  p: 

Before  transformation  After  transformation 

y  y,  if  0  <  y  <  p; 

p,  if  y  =  p  or  y  =  ?; 

y  —  1,  ifp<y<m  and  y  ^  q. 
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6.3  Sequencing  and  Allocation 

Among  the  many  different  sharing  schemes  previously  mentioned,  the  sharing  scheme  with 
fixed  allocation  and  fixed  sequence  is  often  used  in  digital  system  design.  However,  a  de¬ 
signer  needs  to  determine  not  only  the  allocation  of  operations  but  also  the  e.xecution  order 
of  operations,  and  this  determination  will  have  a  significant  impact  on  the  performance  and 
area  of  the  final  implementation.  This  problem  is  analogous  to  the  scheduling  and  allocation 
problem  in  synchronous  system  synthesis,  but  wdth  no  clock-controlled  time  step,  i.e.,  the 
scheduling  problem  in  an  asynchronous  system  cannot  be  viewed  as  a  partitioning  of  oper¬ 
ations  into  time  steps  as  in  synchronous  systems  [11,  18].  This  problem  is  closely  related 
to  the  resource-constrained  project  scheduling  problem  [3],  and  the  temporal  aspects  of  this 
scheduling  problem  can  be  equivalently  represented  by  partial  orders  [22].  Therefore,  this 
problem  is  called  the  sequencing  and  allocation  problem  in  asynchronous  system  synthesis. 

Based  on  the  goal  of  tlr  synthesis  task,  there  are  many  kinds  of  synthesis  problems,  e.g., 
the  cost-constrained  synthesis  problem  and  the  performance-constrained  synthesis  problem 
[24].  In  this  report,  we  only  formulate  and  provide  algorithms  for  the  resource-constrained 
sequencing  and  allocation  problem  for  non-pipelined  systems. 

i.l  Problem  Statement 

The  problem  that  we  are  addressing  is  how  to  sequence  and  allocate  operations  of  an  asyn¬ 
chronous  system  for  a  given  set  of  resources  (operators)  so  that  the  system  will  perform 
efficiently.  The  behavior  of  an  asynchronous  system  is  described  by  a  DFG.  There  are  a  set 
of  n  operations  in  a  system,  each  of  which  corresponds  to  an  atomic  function  in  the  DFG 
and  belongs  to  a  specific  type.  Data  precedence  among  operations  is  implied  by  the  DFG  de¬ 
scription,  where  each  directed  arc  represents  the  direction  of  data  to  be  transferred  between 
two  operations.  There  are  k  types  of  operations  in  the  system.  For  each  type  of  operation, 
there  is  at  least  one  resource  operator.  Each  operator  is  associated  with  a  computation  delay 
time  and  a  backward  control  delay  time,  which  correspond  to  the  DFG  timing  parameters 
Dpp  and  Dbp  respectively.  The  operators  of  the  same  type  don’t  have  to  be  identical,  i.e., 
they  don’t  have  to  have  the  same  values  of  Dpp  and  Dbp-  For  a  non-pipelined  system, 
the  system  performance  is  determined  by  the  completion  time.  (For  a  pipelined  system, 
the  system  performance  is  determined  by  the  throughput  rate  or  the  pipeline  period.)  Our 
synthesis  problem  of  a  system  involves  following  tasks: 

•  What  operator  is  each  operation  allocated  to? 

•  What  is  the  execution  order  of  these  operations  which  are  allocated  to  the  same  oper¬ 
ator? 

The  objective  of  the  problem  is  to  minimize  the  system  completion  time.  Currently  our 
synthesis  algorithms  2issume  that  the  DFG  is  acyclic,  so  the  user  needs  to  unroll  the  loops 
or  choose  the  loop  body  before  synthesizing  the  system. 
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(a)  A  five-addition  DFG  (b)  The  Gantt  chan  of  a  sequence  and  allocation 

Figure  38:  A  simple  example  of  the  sequence  and  allocation  problem. 


6.3.2  Timing  Model  for  Sequencing  and  Allocation  Problem 

The  time  model  used  in  the  sequence  and  allocation  is  the  same  model  that  is  described 
in  Section  5.4.2.  Here  we  want  to  show  how  this  timing  model  is  used  in  the  sequence  and 
allocation  problem  and  how  the  sharing  effects  discussed  in  section  6.1.4  are  used  in  this 
timing  model  for  synthesis. 

Computation  time  and  resource  occupied  time  In  an  asynchronous  system  the  time 
interval  during  which  an  operation  is  computed  by  an  operator,  and  the  time  interval  during 
which  the  operator  is  occupied  by  the  operation,  are  different.  In  Figure  21,  FPi  corresponds 
to  the  computation  time  of  an  operation  at  stage  i,  and  [FPi  -|-  BPi)  corresponds  to  the  time 
that  stage  i  is  occupied  by  the  operation  for  a  non-pipelined  system  operation.  The  time 
that  stage  i  is  occupied  by  an  operation  is  greater  than  (F Pi  +  BPi)  for  a  pipelined  system 
operation.  E.g.,  [FPq  -j-  BPq  +  S)  is  the  time  interval  during  which  an  operation  occupies 
stage  0  for  the  pipelined  operation  in  Figure  21,  where  S  is  the  time  that  the  operation 
is  W2iiting  for  the  input  register  of  the  next  stage  to  be  available.  Since  we  are  dealing 
with  non-pipelined  system  operation,  we  don’t  need  to  worry  about  S  at  this  time.  Figure 
38(a)  is  an  example  DFG  with  five  additions.  There  are  three  identical  adders  available, 
ADDERl,  ADDER2,  and  ADDER3,  and  they  have  Dpp  =  3  and  Dbp  =  3.  For  convenience, 
we  use  a  two-input  addition  function  to  reduce  the  analysis  complexity  and  we  do  not 
consider  the  sharing  overhead  in  this  example.  Figure  38(b)  is  a  Gantt  chart,  representing 
a  sequence  and  allocation  for  these  five  additions,  where  dashed  arrowed  lines  represent  the 
data  precedence  of  the  DFG.  In  this  example,  add4  can  start  when  both  addl  and  add2 
finish  their  computation  because  add4  does  not  share  the  same  operator  with  either.  On  the 
other  hand,  add3  can  start  only  when  addl  releases  the  ADDERl  because  they  are  both 
allocated  to  ADDERl  and  add3  is  sequenced  after  addl. 
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Sharing  effects  The  sharing  scheme  used  in  the  seqtience  and  allocation  in  this  report 
is  the  sharing  scheme  with  fixed  allocation  and  fixed  sequence.  The  sharing  effects  of 
this  scheme  have  been  analyzed  in  Section  6.1.4.  Since  we  are  dealing  with  the  resource- 
constrained  synthesis  task,  the  area  cost  is  determined  when  the  resource  constraints  are  set. 
Therefore,  the  area  cost  is  not  considered  in  this  synthesis  algorithm.  On  the  other  hand. 
FPov  and  BPov  play  important  roles  in  the  measure  of  the  system  performance,  and  they 
will  inflate  the  length  of  the  computation  delay  time  Dfp  and  the  backward  propagation 
delay  time  Dbp  of  an  operation.  The  system  completion  time  is  determined  mainly  by  the 
computation  delay  times  of  the  operations  in  the  system.  Fortunately.  F Poi  can  be  assumed 
to  be  zero,  so  we  can  separate  the  sharing  penalty  from  the  system  completion  time,  though 
BPov  may  still  have  certain  effects  on  the  system  performance. 

6.3.3  Algorithms 

In  this  report,  we  only  present  two  algorithms  to  solve  the  resource-constrained  sequencing 
and  allocation  problem.  The  details  about  theoretical  bcisis  of  these  methods  can  be  found 
in  [3].  They  are  a  heuristic  algorithm  and  a  branch  and  bound  algorithm. 

Heuristic  algorithm  This  algorithm  uses  the  longest  path  delay  from  the  output  of  the 
node  to  the  output  of  the  DFG  to  prioritize  all  operations  in  the  DFG,  then  schedules 
these  operations  one  by  one  to  available  resources.  The  delay  parameter  used  for  the  path 
delay  and  the  operation  delay  is  the  computation  delay  of  each  node,  while  the  backward 
control  delay  is  only  used  to  determine  the  resource  occupation  time  by  an  operation.  Let 
S  be  the  prioritized  list.  Let  Tatartiv),  Tcmp{v),  and  TreUaseiv)  represent  the  starting  time, 
the  computation  completion  time,  and  the  resource- releaised  time  of  each  operation  v.  The 
algorithm  is  described  below. 

1.  Determine  the  critical  path  delay  of  the  DFG,  the  longest  path  delay  from  the  input 
of  the  DFG  to  the  input  of  each  node,  and  the  longest  path  delay  from  the  output  of 
each  node  to  the  output  of  the  DFG. 

2.  Find  the  operations  which  only  receive  data  from  input  ports  of  the  DFG;  according 
to  the  longest  path  delay  from  the  output  of  each  node  to  the  output  of  DFG,  sort 
these  nodes  into  list  5  in  non-increasing  order. 

3.  If  S  is  empty,  then  exit. 

Let  V  is  the  first  operation  in  S. 

4.  Assign  operation  t;  to  a  resource  and  schedule  the  operation  to  intervals  [r,(art(v),  Tcmp{v)] 
and  [Tcnip(u),  7Ve/eaje(^)]  such  that 

•  Tstartiv)  >  Tcmpi'w)  for  every  w  which  is  a  parent  node  of  v. 

•  Tcmp{v)  —  T,tart{v)  equals  the  computation  delay  time  of  the  operator( resource). 

•  Treieaae  ~  Tcmp{v)  equals  the  backward  control  time  of  the  operator. 

•  The  resource  to  be  assigned  is  available  during  the  time  [Tstari(u),Treiea3e(v)l. 
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5.  i;  is  scheduled,  and  it  is  removed  from  5. 

6.  Find  any  child  of  v  whose  parents  are  all  scheduled,  and  put  it  into  5  with  proper 
ordering. 

7.  Go  to  step  3. 

This  algorithm  is  a  polynomial  algorithm  with  O(n^),  where  n  is  the  number  of  nodes  in  the 

DFG. 

Branch  and  bound  algorithm  A  schedule  is  an  active  schedule  if  we  cannot  find  another 
schedule  by  simply  shifting  a  scheduled  node  to  an  earlier  starting  time.  It  has  been  shown 
that  an  optimum  schedule  is  an  active  schedule  [3].  This  algorithm  exhaustively  enumerates 
all  active  schedules  to  find  the  optimum  schedule,  and  the  branch  and  bound  technique  is 
used  to  reduce  the  search  space.  We  will  not  discuss  the  detail  of  this  algorithm,  which  can 
be  found  in  related  literature  [3]. 

7  Examples 

Two  examples  are  presented  in  this  section.  The  first  design  is  a  16-bit  unsigned  add-and- 
shift  multiplier,  and  the  second  design  is  a  16-point,  16-bit  FIR  digital  filter.  We  will  present 
these  examples  following  the  design  procedure  shown  in  Figure  1. 

7.1  Multiplier 

An  asynchronous  16-bit  unsigned  add-and-shift  multiplier  takes  an  input,  which  is  a  (16b, 
16b)  multiplier- multiplicand  pair,  and  produces  an  output,  which  is  a  326  multiplication 
result. 

DFG  description  Figure  39  is  the  input  DFG  description  of  the  multiplier,  where  ASH, 
/I,  and  /2  are  atomic  functions.  ASH  (add-and-shift)  takes  the  multiplicand  (M),  a  par¬ 
tial  multiplication  result  (A,  Q'),  and  a  partial  multiplier  (Q")  to  generate  a  new  partial 
multiplication  result  and  a  new  partial  multiplier  by  means  of  the  following  operations  [12]: 

{ov,A}  *-  A-\-  M  *Qlsb, 

{A,Q}  {on,  A,(5iv/sfl..xsB+‘}> 

where  A,  Q,  and  M  are  16-bit,  Q  is  formed  by  {Q',  Q"},  ov  is  an  1-bit  overflow  for  addition, 
and  LSB'*'^  is  the  second  least  significant  bit.  /I  is  an  atomic  function  that  assigns  zero  as 
the  initial  partial  multiplication  result  into  the  input  multiplier-multiplicand  pair,  i.e.,  /I: 
(multiplier,  multiplicand)  — » (16’h0000,  multiplier,  multiplicand);  /2  is  the  router  i2([lol,  lo 
2, . . . ,  lo  16,  2ol,2o2, . . .  ,2o  16|)  that  extracts  the  output  of  the  multiplier  from  the  output 
of  the  last  ASH,  i.e.,  /2:  (MSl6b_product,  LS16b.product,  multiplicand)  — ♦  32b.product, 
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Figure  39:  Input  DFG  description  for  the  16-bit  add-and-shift  multiplier. 


Sequencing  and  allocation  This  example  has  a  special  structure  with  all  atomic  func¬ 
tions  linearly  connected.  Initially  we  are  more  interested  in  optimizing  the  system  throughput 
rate  for  this  design,  i.e.,  minimizing  the  pipeline  period.  Since  we  don’t  have  an  algorithm  to 
solve  this  problem,  we  manually  synthesized  the  design  with  one,  two,  four,  and  eight  ASH 
units.  Figure  40  shows  the  Gantt  chart  of  the  pipelined  system  synthesis  result  of  the  design 
with  two  ASH  units,  where  the  sixteen  ASH  operations  be  labeled  addl  to  addl6  from  sys¬ 
tem  input  to  system  output,  2md  the  two  ASH  units  be  labeled  ASHERl  and  ASHER2.  We 
also  synthesize  this  design  by  minimizing  the  system  completion  time.  Figure  41  shows  the 
Gantt  chart  of  the  non-pipelined  system  synthesis  result  of  the  design  with  two  ASH  units. 
Comparing  Figure  40  and  41,  we  can  ejtsily  observe  different  results  for  different  objectives. 
We  also  find  two  ASH  units  are  enough  to  have  the  minimum  completion  time  for  this  design, 
i.e.,  more  ASH  units  will  not  provide  any  better  result.  In  the  following  steps,  we  only  show 
the  intermediate  formats  of  our  design  procedure  for  the  sequencing  and  allocation  results 
of  Figures  40  and  41. 


Sharing  schemes  and  local  transformations  The  next  step  is  to  apply  the  sequencing 
and  allocation  result  to  the  input  DFG  using  the  sharing  scheme  with  fixed  allocation  and 
fixed  sequence.  Figures  42  and  43  are  the  mapped  DFGs  corresponding  to  the  synthesis 
results  in  Figures  40  and  41,  where  the  number  of  paths  among  ASHERs  in  Figure  42  and 


48 


the  number  of  paths  among  ASHERs  in  Figure  43  are  the  same  as  the  number  in  Figure 
39,  i.e.,  there  is  a  separate  path  between  each  pair  of  addi  and  add(t  +  1)  for  t  =  1, . . .  15. 
After  applying  the  general  sharing  scheme,  we  find  that  many  paths  can  be  merged.  By 
applying  the  local  transformation  of  Figure  37,  Figure  42  is  reduced  to  Figure  44,  and  Figure 
43  is  reduced  to  Figure  45.  The  control  part  of  the  sharing  structure  depends  on  how  the 
inputs  of  the  MSelector  and  the  outputs  of  the  MDistributors  are  routed.  For  example,  in 
Figures  42  and  43,  the  control  part  CFfsS  generates  the  sequence  (0,1,2,3,4,5,6,7)  repeatedly 
for  both  the  MSelector  and  the  MDistributor,  and  the  control  part  CFfsS’  generates  the 
sequence  (0,1,2,3,4,5,6,7)  repeatedly  for  the  MSelector  and  the  sequence  (1,2,3, 4, 5,6, 7,0) 
repeatedly  for  the  MDistributor;  in  Figures  44  and  45,  the  control  part  CFfsS”  generates 
the  sequence  (0,1, 1,1, 1,1, 1,1)  repeatedly  for  the  MSelector  and  the  sequence  (1,1, 1,1, 1,1, 1,0) 
repeatedly  for  the  MDistributor. 

Register  minimization  in  EDFG  Before  the  RTL  netlist  of  the  design  is  generated, 
we  need  to  transform  the  DFG  description  into  an  EDFG  description.  We  also  can  remove 
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F’5  re  42:  Mapped  DFG  description  with  two  ASHERs  for  the  pipelined  system. 
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Figure  44:  Reduced  DFG  description  with  two  ASHERs  for  the  pipelined  system. 


Figure  45:  Reduced  DFG  description  with  two  ASHERs  for  the  non-pipelined  system. 
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Figure  46:  EDFG  description  with  two  ASHERs  for  the  pipelined  system 


Figure  47:  EDFG  description  with  two  ASHERs  for  the  non-pipelined  system. 


unnecessary  registers  from  the  mapped  EDFG  description.  Currently,  we  do  not  have  an 
algorithm  to  deal  with  the  register  minimization  problem,  so  we  manually  do  the  job  in  this 
design.  Figures  46  and  47  are  the  EDFG  descriptions  corresponding  to  Figures  44  and  4o 
with  some  registers  removed.  In  Figures  46  and  47,  the  loop  of  a  COUNTS,  two  Is,  and 
a  3-input  phantom  MFork  forms  a  3-bit  counter,  and  Dec.2tl  and  Dec.2t2  convert  the  3- 
bit  sequence  (0, 1.2, 3,4, 5,6, 7)  into  the  1-bit  sequence  (0,1,1,1,1,1,1.1 )  and  the  1-bit  sequence 
(1,1, 1,1, 1,1, 1,0),  respectively. 

The  last  step  is  to  map  the  EDFG  description  into  an  RTL  netlist  for  layout  generation. 
The  implementation  results  will  be  shown  in  Section  7.3. 


7.2  FIR  digital  filter 

The  second  design  is  a  16-point,  16-bit  FIR  digital  filter.  The  convolution  sum  of  the  filter 
is 

2/(«)  =  Hfc)  *  x{n  -  k) 

k=0 

A  causal  FIR  system  with  linear  phase  has  the  property  that 
h{k)  h{l5  ~  k),  for  A:  =  0, ...,7 
Therefore,  the  convolution  sum  can  be  reduced  to  the  following  form, 

7 

y(n)  =  5"^  h{k)  *  (x(n  —  k)  x(n  —  15  -f  k)) 

fc=0 

=  /i(0)  ★  (i(n) -f  x(n  —  15)) -f- 
h{l)  *  (x(n  —  1)  -1-  x(n  —  14))  + 

h{7)  *  (x(n  —  7)  -h  x{n  —  8)) 


DFG  description  Figure  48  is  the  input  DFG  description  of  the  FIR  filter.  There  are 
two  inputs  for  this  system:  in_H  is  the  8b  input  for  the  initialization  of  system  coefficient 
h{i)  for  0  <  *  <  7;  in_X  is  the  16b  input  for  x(n)  for  n  >  0.  There  is  one  16b  output, 
out_Y,  for  y(n)  for  n  >  0  in  the  system.  In  Figure  48  ADD,  MUL,  and  /3  are  atomic 
functions.  We  use  a  fixed-point  computation  to  implement  this  design,  where  the  16-bit 
adder,  ADD,  and  the  8-bit  multiplier,  MUL,  are  used  to  manipulate  16-bit  data.  /3  is 
the  router  i2([l,2, 3,4,5,6, 7, 8]),  which  truncates  the  last  8  bits  from  the  16-bit  output  of 
the  ADD,  so  the  following  MUL  can  have  a  proper  8-bit  input.  In  Figure  48,  CFfs8.1  and 
mem_H  are  macro  functions:  CFfs8.1  generates  the  sequence  (0,1,2,3,4,5,6,7)  once  after  the 
system  is  started,  so  eight  data  tokens  read  in  from  in_H  are  distributed  to  proper  mem_Hs 
for  h{0)  to  h(7);  mem_H,  whose  DFG  description  is  shown  in  Figure  49,  reads  in  a  data 
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(a)  DFG  description  of  mem_H 


(b)  DFG  description  of  SeqOlr 


Figure  49:  DFG  description  for  mem_H. 


token  from  its  input  and  keeps  producing  the  same  data  token  forever.  There  are  fifteen 
data  tokens  in  the  DFG  description  of  the  FIR  filter.  These  tokens  represent  the  initial  data 
of  x(n  —  k)  for  1  <  <  15,  and  they  have  the  data  value  zero  in  this  specification,  i.e.. 
x(  — 1)  =  x(—2)  =  ...  =  x(  — 15)  =  0  with  n  =  0  initially.  .After  h(i)  for  0  <  i  <  7  are  read, 
each  input  x(n)  will  generate  an  output  i/(n). 


Construct  Name 

Dpp  (nsec) 

Dbp  (nsec) 

ADD 

16.0 

4.4 

MUL 

36.5 

4.4 

f3 

0.0 

0.0 

MFork_2 

0.0 

3.1 

MJoin_2 

0.0 

0.0 

overhead 

0.0 

14.3 

Table  1:  Timing  parameters  for  the  FIR  filter  synthesis. 


Resource  Constraints 
(No.  of  operators) 

Completion  time 
(nsec) 

No.  of  MUL 

No.  of  ADD 

heuristic 

Optimum 

>  1 

1 

501.8 

501.8 

1 

>  2 

454.9 

454.9 

>  2 

2 

258.9 

258.9 

2 

>  3 

250.1 

250.1 

3 

>  3 

194.9 

194.9 

>  4 

3 

186.8 

4 

>  4 

171.7 

171.7 

>  5 

4 

167.2 

167.2 

>  5 

>  5 

164.5 

164.5 

Table  2:  The  sequencing  and  allocation  results  of  the  16-point  FIR  filter. 

Sequencing  and  allocation  There  are  three  types  of  atomic  functions  in  this  design,  but 
we  only  need  to  consider  two  of  them,  ADD  and  MUL,  since  the  third,  f3,  can  be  implemented 
by  physically  truncating  unused  data.  Currently  we  use  one  kind  of  implementation  for  each 
type  of  operation.  Timing  parameters  for  all  implementations  are  shown  in  Table  1.  Dfp 
of  MFork_2  and  Dpp  labeled  “overhead”,  which  are  both  asterisked  in  Table  1,  are  zero  due 
to  the  parallel  of  data  computation  and  control  generation  in  the  hardware  implementation. 
The  overhead  of  the  backward  propagation  delay  time  for  the  sharing  scheme  is  14.3  nsec. 
For  convenience,  we  remove  the  token  input  part  of  the  DFG,  and  Figure  50  shows  the  DFG 
without  this  token  input  part.  We  assume  that  all  x(n  -  i)  for  i  =  0, . . . ,  15  and  all  h(j)  for 
j  =  0, . . . ,  7  arrive  at  the  same  time,  and  we  want  to  find  the  sequence  and  allocation  with  the 
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minimum  system  completion  time  for  a  given  set  of  resources.  Table  2  shows  the  sequencing 
and  allocation  results,  where  our  heuristic  algorithm  always  found  the  optimum  solution  for 
this  example.  In  the  following  steps,  we  only  show  the  intermediate  format  of  our  design 
procedure  for  a  design  with  2  MULs  and  2  ADDs.  The  sequencing  and  allocation  result 
of  the  2-MUL,  2-ADD  design  is  as  follows,  where  the  three  columns  give  the  computation 
starting  time,  the  completion  time,  and  the  operator  released  time. 


AODl: 


ADD2 


MULl: 


NUL2: 


ad(12:  (  0.0 

16.0 

34.7) 

ad<t3:  (  34.7 

50.7 

69.4) 

addS:  (  69.4 

85.4 

104.1) 

add6:  (104.1 

120.1 

138.8) 

add7:  (138.8 

154.8 

173.6) 

add8:  (173.S 

189.5 

208.2) 

adda:  (208.2 

224.2 

242.9) 

addg:  (242.9 

258.9 

277.6) 

addl:  (  0.0 

16.0 

34.7) 

add4:  (  34.7 

50.7 

69.4) 

adda:  (  69.4 

85.4 

104.1) 

addb:  (107.7 

123.7 

142.4) 

addc:  (142.4 

158.4 

177.1) 

addd:  (177.1 

193.1 

211.8) 

addl:  (224.2 

240.2 

258.9) 

nttl2:  (  16.0 

52.5 

71.2) 

■ul3:  (  71.2 

107.7 

126.4) 

■ul6:  (126.4 

162.9 

181.6) 

mul7:  (181.6 

218.1 

236.8) 

mull:  (  16.0 

52.5 

71.2) 

mul4:  (  71.2 

107.7 

126.4) 

mul6:  (126.4 

162.9 

181.6) 

mul8:  (189.5 

226.0 

244.7) 

The  corresponding  Gantt  chart  of  above  result  is  shown  in  Figure  51. 


Sharing  schemes  and  local  transformations  The  next  step  is  to  apply  the  sequencing 
and  allocation  result  to  the  input  DFG  using  the  sharing  scheme  with  fixed  allocation  and 
fixed  sequence.  Figure  52  is  the  mapped  DFG  corresponding  to  the  synthesis  results  in 
Figure  51,  where  net  labeling  is  used  to  represent  data  path  connectivity.  For  example,  addl 
is  connected  to  the  output  of  J1  and  to  the  right  input  of  JMl  in  Figure  50,  and  these  two 
paths  are  labeled  Jlo  and  JMlr  in  Figure  52.  Referring  to  Figure  51,  addl  is  scheduled  as  the 
first  operation  at  operator  ADD2,  so  Jlo  and  JMlr  are  linked  to  the  corresponding  paths  of 
the  sharing  structure  for  the  ADD2  in  Figure  52,  i.e.,  input  port  0  of  MSelector  and  output 
port  0  of  MDistributor  for  the  ADD2.  After  applying  the  sharing  scheme  for  the  sequencing 
and  allocation  result,  we  are  looking  for  possible  reduction  on  the  mapped  DFG.  In  this 
design,  there  are  two  operands  for  each  operation.  In  order  to  apply  the  same  principle  of 
the  local  transformation  in  Figure  37,  we  need  to  have  corresponding  inputs  and  outputs  of 


57 


□ 


FP(ADD)  =  16.0  nscc 
FP(MUL)  =  36.5  nsec 
BP(ADD/MUL)  =  4.4  nsec 
BP(overhead)  =  14.3  nsec 


Data  Dependency 


add2 

add3 

ADDl 

■ 

r 

1 

aidl 

1 

a(fd4 

ADD2 

OK 

\ 

MULl 

Tip 

I  ” 

w  mul^  V 

MUL2 

L__M 

Figure  51:  The  Gantt  chart  for  the  2-MUL,  2- ADD  FIR  filter. 


two  operations  from  the  same  sources  and  destinations.  For  example,  ( Jbl,  Jbr)  for  iuldb  and  0 

(Jdl,  Jdr)  for  addd  come  from  the  same  sources  MDistributor  of  MULl  for  both  left  operands 
and  MDistributor  of  ADD2  for  both  right  operands,  and  they  have  the  same  destination, 

MSelector  of  ADD2,  through  Jbo  and  Jdo  in  Figure  52.  Therefore,  we  merge  these  two  sets 

to  (Jbdl,  Jbdr)  and  Jbdo.  Similarly,  (Jel,  Jer)  and  Jeo  for  adde  and  (Jgl,  Jgr)  and  Jgo  for 

addg  can  be  merged  to  (Jegl,  Jegr)  and  Jego.  Figure  52  is  thus  reduced  to  Figure  53.  The  ^ 

control  part  of  the  sharing  structure  depends  on  how  the  inputs  of  MSelector  and  the  outputs 

of  MDistributors  are  routed.  For  example,  in  Figures  52  the  control  part  CFfs4,  CFfs7, 

2uid  CFfsS  generate  the  sequence  (0, 1,2,3),  the  sequence  (0,1,2,3,4,5,6),  and  the  sequence 

(0,1,2,3,4,5,6,7)  repeatedly  for  both  MSelectors  and  MDistributors  respectively.  In  Figure 

53,  after  local  transformations,  the  control  part  CFfs4’  generates  the  sequence  (0,1, 2,3)  ^ 

repeatedly  for  the  MSelector  and  the  sequence  (0,1, 1,2)  repeatedly  for  the  MDistributor;  the 

control  part  CFfs4”  generates  the  sequence  (0, 1,2,3)  repeatedly  for  the  MSelector  and  the 

sequence  (0,1,2,2)  repeatedly  for  the  MDistributor.  Similarly,  we  can  find  the  corresponding 

sequences  generated  by  CFfsS’  and  CFfs7’. 

# 

Register  minimization  in  EDFG  After  applying  the  sequencing  and  allocation  result 
in  Figure  53  to  the  original  DFG  description  in  Figure  48,  we  transformed  the  sequenced 
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Figure  52:  Mapped  DFG  description  for  the  2-MUL  2-ADD  FIR  design. 
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Figure  55:  EDFG  description  for  the  2-MUL  2-ADD  FIR  design  (Part  11). 


OUT 


OUT 


(a)  EDFG  description  of  mein_H  (b)  EDFG  description  of  CFfsS.l 
Figure  56:  EDFG  description  for  memJH  and  CFfsS.l. 
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DFG  description  into  EDFG  description.  Then  vve  removed  unnecessary  registers  from  the 
mapped  EDFG  description.  Figures  54  and  55  are  the  final  EDFG  description  after  manual 
register  minimization,  where  the  EDFG  descriptions  of  meni-H  and  CFfsS.l  are  shown  in 
Figure  56.  In  the  final  EDFG  description,  C0U.MT4,  COUNT"  and  COU.NTS  are  used  to 
produce  the  sequences  (0, 1,2,3),  (0,1,2, — 6),  and  (0,1,2,. .  .,7)  respectively,  and  Dec.jtj.U  is 
a  decoder  function  which  converts  a  i-bit  control  signal  to  ;-bit  control  signal  with  index  k 
to  distinct  different  decoders. 

The  last  step  is  to  map  the  EDFG  description  into  RTL  netlist  for  the  layout  generation. 
The  implementation  results  will  be  shown  in  Section  7.3. 

7.3  Experimental  Results 

We  have  implemented  the  example  designs  discussed  in  the  preceding  section  using  a  library 
of  asynchronous  building  blocks  [32]  composed  with  an  industrial  standard  cell  library,  HP 
C34100  [34],  in  a  commercial  CAD  tool.  Cadence  Design  Framework  II^^.  Both  the  RTL 
netlist  and  the  layout  of  these  designs  were  produced  The  performance  of  each  design  has 
been  simulated  in  a  mix-mode  simulator,  Verilog-XL^^\  using  the  model  distributed  with 
the  cell  libr2u:y,  plus  extracted  wiring  capacitances.  The  implementation  and  the  simulation 
of  these  designs  at  the  layout  level  show  the  feasibility  of  our  design  method.  To  show 
the  effectiveness  of  the  data  flow  model,  we  further  compare  the  area/performance  of  these 
designs  obtained  from  our  DFG/EDFG  model  with  the  area/performance  obtained  from  the 
final  layouts. 

To  use  the  DFG/EDFG  model,  we  need  to  know  all  the  timing  parameters,  such  as  D,//, 
Dabi,  Dap,  Djp,  and  Dbp,  during  the  course  of  synthesis  procedure.  Since  the  asynchronous 
building  blocks  have  been  designed,  these  timing  parameters  are  obtained  by  simulation. 
Table  3  and  Table  5  are  the  EDFG  timing  parameters  for  the  multiplier  design  and  for  the  FIR 
design,  respectively.  The  fanout /loading  capacitance  internal  to  each  block  is  considered,  but 
the  fanout /loading  capacitance  external  to  the  block,  which  depends  on  the  interconnection 
of  the  block  in  a  real  design,  is  not  considered  currently.  In  Table  5,  several  modules,  e.g., 
MSelector_4:  have  more  than  one  set  of  timing  parameters,  which  represent  more  than  one 
implementation  for  the  same  EDFG  construct.  The  slow  module  is  used  for  non-critical  nodes 
of  the  FIR  filter  design.  We  use  “DFGsim”  to  label  the  performance  measure  of  the  design 
from  the  simulation  of  the  DFG/EDFG  model.  At  the  DFG/EDFG  level,  the  area  measure 
of  a  design  is  the  au-ea  sum  of  all  asynchronous  blocks  used  in  the  design,  and  the  area  of 
each  block  is  the  area  sum  of  all  standard  cells  which  implement  the  block.  Therefore,  no 
wiring  area  is  considered  at  the  DFG/EDFG  level.  We  use  “Cell”  to  label  the  area  measure. 

The  performance  for  a  real  layout  is  obtained  by  the  simulation  of  standard  cell  netlist 

'“Design  Framework  II  is  a  design  framework,  and  it  is  a  trademark  of  Cadence  Design  Systems,  Inc.  Cell 
Ensemble  is  a  standard  ceil  placement  and  routing  tool  used  in  our  experiments  for  the  layout  generation, 
and  it  is  a  trademark  of  Cadence  Design  Systems,  Inc.  Verilog-XL  is  a  mix-mode  simulator,  and  it  is  a 
registered  trademarks  of  Cadence  Design  Systems,  Inc.  DRACULA  is  an  IC  layout  verification  system,  and 
it  is  a  registered  trademark  of  Cadence  Design  Systems,  Inc. 
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Construct  Name 

Dfp  (nsec) 

Di,p  (nsec) 

D,ji  (nsec) 

Dm  (nsec) 

P<JWI1WA|| 

(Ph)fl 

0.00 

0.00 

— 

— 

■DHII 

(Ph)  f2 

0.00 

0.00 

-- 

— 

— 

(Ph)  ASH 

14.70 

0.00 

— 

— 

— 

(Ph)  COUNTx 

1.70 

0.00 

— 

— 

— 

(Ph)  Dec-itjJ: 

2.10 

0.00 

— 

— 

— 

(Ph)  MSelector.2 

6.03 

4. .8.5 

— 

— 

— 

(Ph)  MDistributorJ2 

9.19 

1..53 

— 

— 

— 

(Ph)  MFork.3 

0.00 

3.10 

— 

— 

-- 

Storage((16b,16b,16b)) 

— 

— 

3.80 

4.80 

0 

Storage(nb),  for  n  <  4 

— 

— 

2.73 

3.77 

0 

Table  3;  Timing  parameters  for  the  EDFG  in  the  experiment  of  the  multiplier  design. 


Number  of 
shared  units 

Completion  time 

'nsec) 

Pipeline  period  (nsec) 

Area  (xlO® 

Csim 

DFGsim 

Ratio 

Csim 

DFGsim 

Ratio 

Core 

Cell 

Ratio 

16 

311.14 

299.80 

0.961 

24.25 

23.30 

0.964 

12.377 

7.236 

0.585 

8 

418.66 

397..56 

0.950 

65.38 

60.40 

0.924 

16.302 

9.857 

0.605 

4 

475.61 

442.28 

0.930 

131.51 

120.80 

0.919 

8.532 

5.183 

0.607 

2  (pipeline) 

504.10 

464.64 

0.922 

263.80 

241.60 

0.916 

4.231 

2.605 

0.616 

1 

519.70 

476.02 

0.916 

528.82 

483.20 

0.914 

2.124 

1.371 

0.645 

2  (non-pipe) 

371.82 

348.32 

0.937 

381.52 

359.20 

mmm 

2.771 

1.671 

0.603 

Table  4:  Experimental  results  of  the  multiplier  design. 

with  wiring  capacitances  derived  from  a  parasitic  extraction  tool,  DRACULA^^^,  and  it  is 
labeled  “Csim”.  The  area  for  a  real  layout  is  obtained  by  the  multiplication  of  the  width  and 
the  height  of  the  layout.  The  area  measured  for  the  core  size  of  the  fined  layout  is  labeled 
“Core”. 

Table  4  gives  the  experimental  results  of  the  16-bit  multiplier.  Figures  57  and  58  show 
two  layouts  for  the  multiplier  implementation.  We  found  that  our  performance  estimation 
from  the  simulation  of  the  DFG/EDFG  model  is  within  91.6%  to  96.4%  of  the  final  layout 
performance  measurement.  The  experimental  result  also  shows  that  the  cell  area,  which  is 
used  as  the  area  measurement  in  the  DFG/EDFG  model,  approximately  occupies  58.5%  to 
64.5%  of  the  final  layout. 

Table  6  gives  the  experimental  results  of  the  16-point  16-bit  FIR  filter.  Figure  59  shows 
one  layout  for  the  FIR  filter  implementation.  Again  we  found  that  the  performance  esti¬ 
mation  from  the  simulation  of  the  DFG/EDFG  model  is  within  87.6%  to  98.1%  of  the  final 
layout  performance  meaisurement.  This  experimental  result  also  shows  that  the  cell  area 
obtained  from  the  DFG/EDFG  model  approximately  occupies  56.2%  to  59.4%  of  the  final 
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Construct  Name 


(Ph)  f3 


(Ph)C(’bl) 


(Ph)  ADD 


(Ph)  MUL 


(Ph)  COUNTx 


(Ph)  Dec.itjJ: 


(Ph)  MSelector-2 


(Ph)  MSelector.3 


(Ph)  MSelector-4  #1 


(Ph)  MSelector_4  #2 


(Ph)  MSelector_5 


(Ph)  MSelector-6 


(Ph)  MSelector-7 


(Ph)  MSelector-8 


(Ph)  MSelector.lO 


(Ph)  MDistributor^ 


(Ph)  MDistributorJJ  #1 


(Ph)  MDistributorJI  #2 


(Ph)  MDistributor^ 


(Ph)  MDistributorJS  #1 


(Ph)  MDistributofjS  #2 


(Ph)  MDistributor.lO 


(Ph)  MForkJZ 


(Ph)  MFork-3 


(Ph)  MJoinJZ 


Storage((8b,8b)) 


Storage((16b,16b)) 


Storage(8b)  #1 


Storage(8b)  #2 


Storage(16b)  #1 


Storage(16b)  #2 


Storage(nb),  for  n  <  3 


0.00 


0.00 


11.60 


32.10 


1.70 


2.10 


8.87 


8.92 


5.29 


8.92 


6.77 


7.08 


6.87 


10.26 


6.51 


9.19 


5.18 


8.72 


5.96 


6.30 


11.00 


6.96 


0.00 


0.00 


3.10 


Dbp  (nsec) 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


6.28 


5.91 


6.13 


6.59 


6.90 


6.50 


6.72 


7.64 


7.77 


1.53 


2.68 


2.84 


3.72 


3.75 


3.92 


5.66 


3.10 


3.10 


0.00 


D,ji  (nsec)  j  Dm  (nsecH  D,p  (nsce) 


2.80 

3.83 

0 

2.80 

3.83 

0 

2.80 

3.83 

0 

4.25 

5.27 

0 

2.80 

3.83 

0 

4.25 

5.27 

0 

2.73 

3.77 

0 

Table  5:  Timing  parameters  for  the  EDFG  in  the  experiment  of  the  FIR  design. 


\WMsm 

Completion  time 

'nsec) 

Pipeline  period  (nsec) 

Area  (xlO®  p^) 

mult. 

adder 

Csim 

DFGsim 

Ratio 

Csim 

DFGsim 

Ratio 

Core 

Cell 

Ratio 

1 

1 

615.84 

539.40 

0.876 

627.20 

552.41 

0.881 

16.434 

9.583 

0.583 

2 

2 

298.53 

273.85 

0.917 

307.32 

283.90 

0.924 

20.923 

12.419 

0.594 

3 

3 

210.40 

202.68 

0.963 

217.01 

212.88 

0.981 

23.707 

13.318 

0.562 

Table  6:  Experimental  results  of  the  FIR  design. 
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layout. 


The  Csirn  (actual  extracted)  value  is  larger  than  the  DFGsini  (estimated)  value  for  each 
design,  due  to  the  following  reasons. 

1.  The  operation  fanout  and  the  control  fanout  external  to  the  basic  blocks  are  not  con¬ 
sidered  in  the  current  model. 

2.  The  wiring  delays  between  modules  are  also  not  considered  in  the  current  model;  this 
delay  cannot  be  accurately  estimated  until  the  actual  layout  is  generated. 

3.  Some  extra  buffers  were  needed  between  some  modules  to  comply  with  a  CAD  tool 
limitation  in  the  current  implementation,  and  these  extra  delays  were  not  known  during 
synthesis  and  analysis  of  the  design  at  the  DFG/EDFG  model. 

Despite  these  factors,  our  high-level  timing  model  is  quite  accurate;  the  DFGsim/Csim  ratio 
is  98.1%  for  the  best  case  and  is  87.6%  for  the  worst  case  in  all  our  experimental  results. 

It  is  common  to  use  the  cell  area  to  estimate  the  routing  overhead  before  placement 
and  routing  [37].  From  above  two  examples,  the  Cell/Core  ratio  is  within  56.2%  to  64.5%. 
Although  the  area  ratios  for  the  above  design  are  not  fixed,  these  ratios  vary  in  a  certain 
small  range.  Therefore,  the  area  measure  by  the  cell  area  in  the  data  flow  model  is  sufficient 
for  high-level  synthesis  algorithms. 

Since  we  have  an  accuracy  timing  model  and  a  proper  area  measurement  at  the  data 
flow  level,  synthesis  results  represent  the  design  space  properly.  One  of  the  main  rea¬ 
sons  that  synthesis  algorithms  explore  the  design  space  properly  is  that  in  our  system 
the  area/performance  overhead  of  resource  sharing,  such  as  control  units  and  multiplex¬ 
ers/demultiplexers,  are  accurately  reflected  and  predicted.  This  allows  us  to  estimate  both 
the  performance  and  the  area  at  the  data  flow  level  quite  accurately. 

8  Conclusion 

In  this  report,  we  have  presented  a  design  method  for  asynchronous  systems  based  on  the 
data  flow  specification.  An  asynchronous  system  is  seen  as  a  set  of  communicating  processes, 
and  the  token  data  flow  model  is  used  to  describe  the  behavior  of  the  system.  This  design 
method  not  only  provides  a  high-level  description  language,  but  also  provides  a  systematic 
transformation  within  the  data  flow  model  to  support  high-level  synthesis  such  as  sequencing 
and  allocation,  sharing  schemes,  local  transformations,  and  register  minimization.  Finally, 
the  data  flow  specification  is  transformed  into  an  EDFG  description  which  is  sufficient  for 
layout  realization. 

In  order  to  make  the  synthesis  result  useful,  we  have  derived  a  timing  model  for  the  data 
flow  specification  so  that  the  synthesis  algorithms  have  an  accurate  and  practical  model. 
Experimental  results  show  the  effectiveness  of  using  the  timed  data  flow  specification  to 
design,  analyze  and  synthesize  cisynchronous  systems.  Experimental  results  also  show  that 
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the  accuracy  of  our  timing  model  at  the  data  flow  model  is  within  90%  of  the  actual  imple¬ 
mentation. 

Many  other  issues  regarding  the  design  of  basic  asynchronous  blocks  and  the  hardware 
implementation  have  not  been  covered  in  this  report,  but  they  play  an  important  role  in  real¬ 
izing  and  demonstrating  this  design  method.  The  detail  of  the  design  of  basic  asynchronous 
blocks  can  be  found  in  [32]. 
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Figure  57:  Layout  for  the  2-ASH  pipelined  multiplier  design. 
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Figure  58:  Layout  for  the  2-ASH  non-pipelined  multiplier  design. 
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Figure  59:  Layout  for  the  2-ADD  2-MUL  FIR  design. 
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